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Novel Human Alpha Macro^tobuin Family Proteins and Polynudeatides EnccMSng the Same 

) 

characterize the protein. A starting material that can only be used to produce 
a final product does not have a substantial asserted utility in those instances 
Where the final product is not supported by a specific and sujjstantial utility, 
In this case none of the proteins that are to be produced as final products 
resulting from processes involving the claimed cDNA have asserted or 
identified specific and substantial utilities. The research contemplated by 
Applicants to characterize potential protein products, especially their 
biological activities, does not constitute a specific and substantial utility. 
Identifying and studying the properties of the protein itself or the 
mechanisms in which the protein is involved does not defme a "real world" 
context of use. Note, because the claimed invention is not supported by a 
specific and substantial asserted utility for the reasons set forth above, 
credibility has not been assessed: Neither the specification as filed nor any 
art of record discloses or suggests any property or activity for the cDNA 
compounds such that another non-asserted utility would be well established 
for the compounds. 

Claim 1 is also rejected under 35 U.S.C. § 1 12, first paragraph. 
Specifically, since the claimed invention is not supported by either a specific 
and substantial asserted utility or a well established utility for the reasons set 
forth above, one skilled in Oie art would not know how to use the claimed 
invention. 

Example 10: DNA Fragment encodin g a Full Open Reading Frame 
(ORF) 

Specification: The specification discloses that a cDNA library was prepared 
from human kidney epithelial cells and 5000 members of this library were 
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App Serial # 1(V020.095 ExMbit L 

Wafeeelal. LEX<0282USA 

MovdHiman Alpha fctecrofllolKto Family Proteins and Po^^^ I 

THE HUiViAN 
GENOiVlE 

: umaiiity 'has been given a great ^ft WiA the completion- of Aehiin^in, , 
genome sequence, we have received a powerful tool for unlocking 
secrets of our genetic heritage and for finding our place among the other . 
participants in the adventure of life. • „ xr 

This week's issue of Science contains the «port of the seqi^ncmg of 
the human genome from a group of authors led by Craig Venterlof Celera • 
Genomics. The report of the sequencbg of the human genome from the 
publicly funded consortium of laboratories led by Francis CoUinsappears 
in this week's Nature. This stunning achievement has been portrayed— 
often unfairly— as a competition between two 
ventures, one pubUc and one private. That 

the awesome accomplishment jointly unveiled this week. In tru J. each 
project contributed to the other. The inspbed vision that launched Ae 
Jublicly funded project roughly 10 years ago '<'^^f^,f^^^Z^^t' 
tiie confidence of those who believe that the pursuit of largwcale fon^- 
• mental problems in the Ufe sciences is in the national mterest t^^hm^ , 
Novation and drive of Craig Venter and his "Heagues made it^^^^^ 
to celebrate this accompUshment far sooner than was P<^^^^ 
Thus we can salute what has become, in the end, not a contest but a 
marriage (perhaps encouraged by shotgun) between pubbc fundmg and 

verKnceisiiivaluable.Indeed,areal-WT)rldptoofpfthemiportanceota 

,h™5rr=TO^^^ 

announcement falls during the week <>f ^a^^^;*;'^^^ 

message that the survival of a specu:s can <J=P«°^ °° '^f '^^^^^ the Celera data. 

pecuUarly pertinent to discussions that have gone on m the past ^J" J^;^ '^^^^^ can be 
^ull InfirSiation regarding the agreements that »° ^f^^^ : 

Udatwww.sciencemag.o.g/fey.re^dat^^^^^ to all the., 

allowing data repo«tories other thM fl« t«^^ 
data needed to verify conclusions. In this domain change « *W>?™« 

satisfies our continuing commitment to fiill access. • ^refitllv watched, has created ' 

I, should be no surprise that an achievement so ftumung and so ^eMly ^f^^ 

new challenges for t^^.-en^ic ventu«^^^^^^^^^ 

discovery onto the public stage. It is '™*^*i'K[fe ^ther it is a Ubrary, in which, with , 
deavor. The human genome.has been called the Bop^^Jf ' f that will ^ 

nilw that encourage exploration and reward aeaUyity, we can find many ot tne ooo 
belpdcfincusandourplaccinthegreattapestiyofhfe. p^^^3;,'^j3,„y ,ndDo„aW Kennedy- Z 



sequenced and open reading frames were identified. The specification 
discloses a Table that indicates that one member of the library having SEQ 
ID NO; 2 has ahigh level of homology to a DNA ligase. The specification 
teaches that this complete ORF (SEQ ID NO: 2) encodes SEQ ID NO: 3. 
An alignment of SEQ ID NO: 3 with known amino acid sequences of DNA 
ligases indicates that there is a high level of sequence conservation between 
the various known ligases. The overall level of sequence similarity between 
SEQ ID NO: 3 and the consensus sequence of the known DNA ligases that 
are presented in the specification reveals a similarity score of 95%. A search 
of the prior art confirms that SEQ ID NO: 2 has high homology to DNA 
Ligase encoding nucleic acids and that the next highest level of homology is 
to alpha-actin. However, the latter homology is only 50%. Based on the 
sequence homologies, the specification asserts that SEQ ID NO: 2 encodes a 
DNA ligase. 

Claim 1: An isolated and purified nucleic acid comprising SEQ ID NO: 2. 

Analysis: The following analysis includes the questions that need to be 
asked according to the guidelines and the answers to those questions based 
on the above facts: 

1) Based on the record, is there a "well established utility" for the 
claimed invention? Based upon applicant's disclosure and the results of the 
PTO search, there is no reason to doubt the assertion that SEQ ID NO: 2 
encodes a DNA ligase. Further, DNA ligases have a well-established use in 
the molecular biology art based on this class of protein's ability to ligate 
DNA. Consequently the answer to the question is yes. 



54 



Note that if there is a well-established utility already associated with the 
claimed invention, the utility need not be asserted in the specification as 
filed. In order to determine whether the claimed invention has a well- 
established utility the examiner must determine that the invention has a 
specific, substantial and credible utility that would have been readily 
apparent to one of skill in the art. In this case SEQ ID NO: 2 was shown to 
encode a DNA ligase that the artisan would have recognized as having a 
specific, substantial and credible utility based on its enzymatic activity. 

Thus, the conclusion reached from this analysis is that a 35 U.S.C. § 
101 rejection and a 35 U.S.C. § 1 12, furst paragraph, utility rejection should 
not be made. 

Example 11: Animals with T Tnohflracterized Human Genes 

Specification: Kidney cells from a patient with Polycystic Kidney (PCK) 
Disease have been used to make a cDNA library. From this library 8000 
nucleotide "fragments" have been sequenced but not yet used to express 
proteins in a transformed host cell nor have they been characterized in any 
other way. The 50 longest fragments, SEQ ID NO: 1-50, respectively, have 
been used to make transgenic mice. None of the 50 lines of mice have 
developed Polycystic Kidney Disease to date. The asserted utility is the use 
of the mice to research human genes from diseased human kidneys. The 
disease is inheritable, but chromosomal loci have not yet been identified. 
Neither the absence or presence of a specific protein has been identified with 
the disease condition. 
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AwS«faI#10rtJ20.085 ExhMB 
wam ei ai. LEX-02a2-USA 



>AF4104yStj;^ q^B^ !g6N: AF410459 NID: gi 19071208 gb AF410459.1 Homo 
sapiens CD109 (CD109) mRNA, complete cds 
Length = 5883 

Score = 2854 bits (7317), Expect = 0.0 

Identities = 1427/1445 (98%), Positives = 1428/1445 (98%), Gaps = 17/1445 (1%) 
Frame = +2 

Query: 1 MQGPPLLTAAHLLCVCTAALAVAPGPRFLVTAPGIIRPGGNVTIGVELLEHCPSQVTVKA 60 

MQGPPLLTAAHLLCVCTAALAVAPGPRFLVTAPGIIRPGGNVTIGVELLEHCPSQVTVKA 
Sbjct: 113 MQGPPLLTAAHLLCVCTAALAVAPGPRFLVTAPGIIRPGGNVTIGVELLEHCPSQVTVKA 292 

Query: 61 ELLKTASNLTVSVLEAEGVFEKGSFKTLTLPSLPLNSADEIYELRVTGRTQDEILFSNST 12 0 

ELLKTASNLTVSVLEAEGVFEKGSFKTLTLPSLPLNSADEIYELRVTGRTQDEILFSNST 
Sbjct: 293 ELLKTASNLTVSVLEAEGVFEKGSFKTLTLPSLPLNSADEIYELRVTGRTQDEILFSNST 472 

Query: 121 RLSFETKRISVFIQTDKALYKPKQEVKFRIVTLFSDFKPYKTSLNILIKDPKSNLIQQWL 180 

RLSFETKRISVFIQTDKALYKPKQEVKFRIVTLFSDFKPYKTSLNILIKDPKSNLIQQWL 
Sbjct: 473 RLSFETKRISVFIQTDKALYKPKQEVKFRIVTLFSDFKPYKTSLNILIKDPKSNLIQQWL 652 

Query: 181 SQQSDLGVISKTFQLSSHPILGDWSIQVQVNDQTYYQSFQVSEYVLPKFEVTLQTPLYCS 240 

SQQSDLGVISKTFQLSSHPILGDWSIQVQVNDQTYYQSFQVSEYVLPKFEVTLQTPLYCS 
Sbjct: 653 SQQSDLGVISKTFQLSSHPILGDWSIQVQVNDQTYYQSFQVSEYVLPKFEVTLQTPLYCS 832 

Query: 241 MNSKHLNGTITAKYTYGKPVKGDVTLTFLPLSFWGKKKNITKTFKINGSANFSFNDEEMK 300 

MNSKHLNGTITAKYTYGKPVKGDVTLTFLPLSFWGKKKNITKTFKINGSANFSFNDEEMK 
Sbjct: 833 MNSKHLNGTITAKYTYGKPVKGDVTLTFLPLSFWGKKKNITKTFKINGSANFSFNDEEMK 1012 

Query: 301 NVMDSSNGLSEYLDLSSPGPVEILTTVTESVTGISRNVSTNVFFKQHDYIIEFFDYTTVL 360 

NVMDSSNGLSEYLDLSSPGPVEILTTVTESVTGISRNVSTNVFFKQHDYIIEFFDYTTVL 
.Sbjct: 1013 NVMDSSNGLSEYLDLSSPGPVEILTTVTESVTGISRNVSTNVFFKQHDYIIEFFDYTTVL 1192 

Query: 361 KPSLNFTATVKVTRADGNQLTLEERRNNWITVTQRNYTEYWSGSNSGNQKMEAVQ^ 420 

KPSLNFTATVKVTRADGNQLTLEERR^nSIWITVTQRNYTEYWSGSNSGNQKMEAVQKINY 
Sbjct: 1193 KPSLNFTATVKVTRADGNQLTLEERRNNWITVTQRNYTEYWSGSNSGNQKMEAVQKINY 1372 

Query: 421 TVPQSGTFKIEFPILEDSSELQLKAYFLGSKSSMAVHSLFKSPSKTYIQLKTRDENIKVG 480 

TVPQSGTFKIEFPILEDSSELQLKAYFLGSKSSMAVHSLFKSPSKTYIQLKTRDENIKVG 
Sbjct: 1373 TVPQSGTFKIEFPILEDSSELQLKAYFLGSKSSMAVHSLFKSPSKTYIQLKTRDENIKVG 1552 

Query: 481 SPFELWSGNKRLKELSYMWSRGQLVAVGKQNSTMFSLTPENSWTPKACVIVYYIEDDG 540 

SPFELWSGNKRLKELSYMWSRGQLVAVGKQNSTMFSLTPENSWTPKACVIVYYIEDDG 
Sbjct: 1553 SPFELWSGNKRLKELSYMWSRGQLVAVGKQNSTMFSLTPENSWTPKACVIVYYIEDDG 1732 

Query: 541 EIISDVLKIPVQLVFKNKIKLYWSKVKAEPSEKVSLRISVTQPDSIVGIVAVDKSVNLMN 600 

EI I SDVLKI PVQLVFKNKI KLYWSKVKAEPSEKVSLRI SVTQPDS I VGI VAVDKSVl^MN 
Sbjct: 1733 EIISDVLKIPVQLVFKNKIKLYWSKVKAEPSEKVSLRISVTQPDSIVGIVAVDKSVNLMN 1912 

Query : 601 ASNDITMENWHELELYNTGYYLGMFMNSFAVFQECGLWVLTDANLTKDYIDGVYDNAEY 660 

ASNDITMENWHELELYNTGYYLGMFMNSFAVFQECGLW\^TDANLTKDYIDGVYDNAEY 
Sbjct: 1913 ASNDITMENWHELELYNTGYYLGMFl^SFAVFQECGLWVLTDANLTKDYIDGVYDNAEY 2092 

Query: 661 AERFMEENEGHIVDIHDFSLGSSPHVRKHFPETWIWLDTNMGYRIYQEFEVTVPDSITSW 720 

AERFMEENEGHIVDIHDFSLGSSPHVRKHFPETWIWLDTNMGYRIYQEFEVTVPDSITSW 
Sbjct: 2093 AERFMEENEGHIVDIHDFSLGSSPHVRKHFPETWIWLDTNMGYRIYQEFEVTVPDSITSW 2272 



Query: 
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Sbjct: 


2273 


Query: 


781 


Sbjct: 


2453 


Query : 


841 


Sbjct : 


2633 


Query : 


901 


Sbjct: 


2813 


Query : 


961 


Sbjct: 


2993 


Query : 


1021 


Sbjct: 


3173 


Query : 


1081 


Sbjct: 


3353 


Query : 


1141 


Sbjct: 


3533 


Query : 


1201 


Sbjct: 


3713 


Query : 


1244 


Sbjct: 


3893 


Query: 


1304 


Sbjct : 


4073 


Query : 


1364 
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4253 


Query : 


1424 


Sbjct: 


4433 
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YQRELLYQREDGSFSAFGNYDPSGSTWLSAFVLRCFLEADPYIDIDQNVLHRTYTWLKGH 
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AF410459 5883 bp mRNA linear 

Homo sapiens CD109 (CD109) mRNA, complete cds. 
AF410459 

AF410459.1 GI: 19071208 



Homo sapiens (human) 
Homo sapiens 

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; 
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. 

1 (bases 1 to 5883) 

Lin,M.\, Sutherland, D.R. , Horsfall,W., Totty,N., Yeo,E., Nayar,R., 
Wu,X.F. and Schuh^A.C. 

Cell surface antigen CD109 is a novel member of the alpha (2) 
macr,oglobulin/C3, C4, C5 family of thioester-containing proteins 
Blood 99 (5), 1683-1691 (2002) 

21849742 
11861284 

2 (bases 1 to 5883) 

Lin,M., Sutherland, D.R. , Horsfall,W., Totty,N., Yeo,E., Nayar,R., 

Wu,X.-F. and Schuh,A.C. 
Direct Submission 

Submitted (14-AUG-2001) Medicine, University of Toronto, 1 King's 
College Circle, Room 7366, Toronto, Ontario M5S 1A8, Canada 

Location/Qualifiers 

1. .5883 

/organism="Homo sapiens" 

/mol_type= "mRNA" 

/ db_xr e f = " t axon : 9 6 0 6 " 

/clone="Kl" 

1. .5883 

/gene="CDl09" 

1. .112 

/gene="CD109" 
113 . .4450 
/gene="CD109" 

/note=" associated with the Gov alloantigen system" 

/codon_start=l 

/product="CD109" 

/protein_id=" AAL84159 .1 " 

/db_xref ="GI : 19071209 " 

/ translation^ "MQGPPLLTAAHLLCVCTAALAVAPGPRFLVTAPGIIRPGGNVTI 
GVELLEHCPSQVTVKAELLKTASNLTVSVLEAEGVFEKGSFKTLTLPSLPLNSADEIY 
ELRVTGRTQDEILFSNSTRLSFETKRISVFIQTDKALYKPKQEVKFRIVTLFSDFKPY 
KTSLNILIKDPKSNLIQQWLSQQSDLGVISKTFQLSSHPILGDWSIQVQVNDQTYYQS 
FQVSEYVLPKFEVTLQTPLYCSMNSKHLNGTITAKYTYGKPVKGDVTLTFLPLSFWGK 
KKNITKTFKINGSANFSFNDEEMKNVMDSSNGLSEYLDLSSPGPVEILTTVTESVTGI 
SRNVSTNVFFKQHDYIIEFFDYTTVLKPSLNFTATVKVTRADGNQLTLEERRNNWIT 
VTQRNYTEYWSGSNSGNQKMEAVQKINYTVPQSGTFKIEFPILEDSSELQLKAYFLGS 
KSSMAVHSLFKSPSKTYIQLKTRDENIKVGSPFELWSGNKRLKELSYMWSRGQLVA 



http://www.ncbi.nlrn.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=19071208 
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3 'UTR 



inisc_f eature 



VGKQNSTMFSLTPENSWTPKACVIVYYIEDDGEIISDVLKIPVQLVFKNKIKLYWSKV 

KAEPSEKVSLRISVTQPDSIVGIVAVDKSVNLMNASNDITMENWHELELYNTGYYLG 

MFMNSFAVFQECGLWVLTDANLTKDYIDGVYDNAEYAERFMEENEGHIVDIHDFSLGS 

SPHVRKHFPETWIWLDTNMGYRIYQEFEVTVPDSITSWVATGFVISEDLGLGLTTTPV 

ELQAFQPFFIFLNLPYSVIRGEEFALEITIFNYLKDATEVKVIIEKSDKFDILMTSNE 

INATGHQQTLLVPSEDGATVLFPIRPTHLGEIPITVTALSPTASDAVTQMILVKAEGI 

EKSYSQSILLDLTDNRLQSTLKTLSFSFPPNTVTGSERVQITAIGDVLGPSINGLASL 

IRMPYGCGEQNMINFAPNIYILDYLTKKKQLTDNLKEKALSFMRQGYQRELLYQREDG 

SFSAFGNYDPSGSTWLSAFVLRCFLEADPYIDIDQNVLHRTYTWLKGHQKSNGEFWDP 

GRVIHSELQGGNKSPVTLTAYIVTSLLGYRKYQPNIDVQESIHFLESEFSRGISDNYT 

LALITYALSSVGSPKAKEALNMLTWRAEQEGGMQFWVSSESKLSDSWQPRSLDIEVAA 

YALLSHFLQFQTSEGIPIMRWLSRQRNSLGGFASTQDTTVALKALSEFAALMNTERTN 

IQVTVTGPSSPSPVKFLIDTHNRLLLQTAELAWQPMAVNISANGFGFAICQLNWYN 

VKASGSSRRRRSIQNQEAFDLDVAVKENKDDLNHVDLIWCTSFSGPGRSGMALMEVNL 

LSGFMVPSEAI SLSETVKKVEYDHGKLNLYLDSVNETQFC VNI PAVRNFKVSNTQDAS 

VSIVDYYEPRRQAVRSYNSEVKLSSCDLCSDVQGCRPCEDGASGSHHHSSVIFIFCFK 

LLYFMELWL" 

4451. .5883 

/gene="CD109" 

4748 

/gene="CD109" 

/note=" alternative polyA site found on clone Kl" 



ORIGIN 

1 
61 
121 
181 
241 
301 
361 
421 
481 
541 
601 
661 
721 
781 
841 
901 
961 
1021 
1081 
1141 
1201 
1261 
1321 
1381 
1441 
1501 
1561 
1621 
1681 
1741 
1801 
1861 
1921 
1981 
2041 
2101 
2161 



ctaaactcga 
tgcccgcgaa 
cccaccgctc 
tcccgggcct 
tattggggtg 
caagacagca 
ctcttttaag 
gctacgtgta 
atttgagacc 
gcaagaagtg 
tttaaacatt 
aagtgatctt 
ctggtctatt 
atatgtatta 
taagcattta 
cgtaacgctt 
atttaagata 
ggattcttca 
tttaaccaca 
cttcaagcaa 
tctcaacttc 
agaaagaaga 
cggatctaac 
ccaaagtgga 
gaaggcctat 
tagtaagaca 
tgagttggtg 
gggacagttg 
ttcttggact 
aagtgatgtt 
gagtaaagtg 
tgactccata 
tgatattaca 
aggcatgttc 
tgcaaacctc 
gtttatggag 
tccacatgtc 



attaagaggg 
cttccccggc 
ctgaccgccg 
cggtttctgg 
gagcttctgg 
tcaaacctca 
acacttactc 
accggacgta 
aagagaatat 
aagtttcgca 
ctcattaagg 
ggagtcattt 
caagttcaag 
ccaaaatttg 
aatggtacca 
acatttttac 
aatggatctg 
aatggacttt 
gtgacagaat 
catgattaca 
acagccactg 
aataatgtag 
agtggaaatc 
acttttaaga 
ttccttggta 
tacatccaac 
gttagtggca 
gtggctgtag 
ccaaaagcct 
ctaaaaattc 
aaagctgaac 
gttgggattg 
atggaaaatg 
atgaattctt 
acgaaggatt 
gaaaatgaag 
cgaaagcatt 



aaaaaaaatc 
agcggactgt 
cccacctcct 
tgacagcccc 
aacactgccc 
ctgtctctgt 
ttccatcact 
cccaggatga 
ctgtcttcat 
ttgttacact 
ac.cccaaatc 
ccaaaacttt 
tgaatgacca 
aagtgacttt 
tcacggcaaa 
ctttatcctt 
caaacttctc 
ctgaatacct 
cagttacagg 
tcattgagtt 
tgaaggtaac 
tcataacagt 
agaaaatgga 
ttgaattccc 
gtaaaagtag 
taaaaacaag 
acaaacgatt 
gaaaacaaaa 
gtgtaattgt 
ctgttcagct 
catctgagaa 
tagctgttga 
tggtccatga 
ttgcagtctt 
atattgatgg 
gacatattgt 
ttccagagac 



agggaggagg 
agcccaggca 
ctgcgtgtgc 
agggatcatc 
ttcacaggtg 
cctggaagca 
acctctgaac 
gattttattc 
tcaaacagac 
cttctcagat 
aaatttgatc 
tcagctatct 
gacatattat 
gcagacacca 
gtatacatat 
ttggggaaag 
ttttaatgat 
ggatctatct 
tatttcaaga 
ttttgattat 
tcgtgctgat 
gacacagaga 
agctgttcag 
aatcctggag 
catggcagtt 
agatgaaaat 
gaaggagtta 
ttcaacaatg 
gtattatatt 
tgtttttaaa 
agtctctctt 
caaaagtgtg 
gttggaactt 
tcaggaatgt 
tgtttatgac 
agatattcat 
ttggatttgg 



tggcaagcca 
gacgccgtcg 
accgccgcgc 
aggcccggag 
actgtgaagg 
gaaggagtct 
agtgcagatg 
tctaatagta 
aaggccttat 
tttaagcctt 
caacagtggt 
tcccatccaa 
caatcatttc 
ttatattgtt 
gggaagccag 
aagaaaaata 
gaagagatga 
tcccctggac 
aatgtaagca 
actactgtct 
ggcaaccaac 
aactatactg 
aaaataaatt 
gattccagtg 
catagtctgt 
ataaaggtgg 
agctatatgg 
ttctctttaa 
gaagatgatg 
aataagataa 
aggatctctg 
aatctgatga 
tataacacag 
ggactctggg 
aatgcagaat 
gacttttctt 
ctagacacca 



caccccacgg 
agatgcaggg 
tggccgtggc 
gaaatgtgac 
cggagctgct 
ttgaaaaagg 
agatttatga 
cccgcttatc 
acaagccaaa 
acaaaacctc 
tgtcacaaca 
tacttggtga 
aggtttcaga 
ctatgaattc 
tgaaaggaga 
ttacaaaaac 
aaaatgtaat 
cagtagaaat 
ctaatgtgtt 
tgaagccatc 
tgactcttga 
agtactggag 
atactgtccc 
agctacagtt 
ttaagtctcc 
gatcgccttt 
tagtatccag 
caccagaaaa 
gggaaattat 
agctatattg 
tgacacagcc 
atgcctctaa 
gatattattt 
tattgacaga 
atgctgagag 
tgggtagcag 
acatgggtta 
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2221 
2281 
2341 
2401 
2461 
2521 
2581 
2641 
2701 
2761 
2821 
2881 
2941 
3001 
3061 
3121 
3181 
3241 
3301 
3361 
3421 
3481 
3541 
3601 
3661 
3721 
3781 
3841 
3901 
3961 
4021 
4081 
4141 
4201 
4261 
4321 
4381 
4441 
4501 
4561 
4621 
4681 
4741 
4801 
4861 
4921 
4981 
5041 
5101 
5161 
5221 
5281 
5341 
5401 
5461 
5521 
5581 
5641 
5701 
5761 



caggatttac 
tggttttgtg 
agccttccaa 
atttgctttg 
cattgagaaa 
ccaccagcag 
gccaacacat 
tgctgtcacc 
catcttatta 
atttcctcct 
tcttggtcct 
acagaacatg 
acaactgaca 
agaacttctc 
tgggagcact 
agatattgat 
caacggtgaa 
aagtccagta 
gcctaacatt 
ttcagacaat 
agcgaaggaa 
ctgggtgtca 
agttgcagcc 
aattatgagg 
taccactgtg 
aaatatccaa 
cacacacaac 
taatatttcc 
gaaggcttct 
agatgttgct 
aagcttttcg 
ctttatggtg 
tcatggaaaa 
tcctgctgtg 
ttactatgag 
ctgtgacctt 
ccatcatcac 
ttggctgtga 
tgattgtttt 
ttctatgggg 
cttcacctga 
ctcacaaaat 
ctttagaaat 
attacacatc 
aggaggagct 
caacccaccc 
cagaatctag 
cacacccccc 
tacatttact 

ggggggtgct 
aggggtgaga 
tagaccaggg 
ccgacctccc 
tgactgctga 
caggggtgtg 
ttaagatccc 
ataatttatc 
ttctttcctt 
tcaacaatta 
atggcatttc 



caagaatttg 
atctctgagg 
ccatttttca 
gaaataacta 
agtgacaaat 
acccttctgg 
ctgggagaaa 
cagatgattt 
gacttgactg 
aatacagtga 
tccatcaatg 
ataaattttg 
gataatttga 
tatcagaggg 
tggttgtcag 
cagaatgtgt 
ttttgggatc 
acacttacag 
gatgtgcaag 
tatactctag 
gctttgaata 
tcagagtcca 
tatgcactgc 
tggctaagca 
gctttaaagg 
gtgaccgtga 
cgcttactcc 
gcaaatggtt 
gggtcttcta 
gtaaaagaaa 
ggcccgggta 
ccttcagaag 
ctcaacctct 
agaaacttta 
ccaaggagac 
tgcagtgatg 
tcttcagtca 
tttattttta 
gttttcgtag 
ttgcagggat 
tctttgtgtg 
cttttagaat 
aggtattctc 
catgtttgct 
tcatctgttc 
catgcccagt 
gtggtgaatt 
agccattgcc 
gtgcttttca 
gggtgtgctg 
gtcagagagc 
agacactgtg 
tgcagtttgg 
cctaaagatc 
agcattagac 
cgaatcctga 
tggatggaaa 
ctccctttct 
atgatctttt 
actcaagtgg 



aagtaactgt 
acctgggtct 
tttttttgaa 
tattcaatta 
ttgatattct 
ttcccagtga 
ttcctatcac 
tagtaaaggc 
acaataggct 
ctggcagtga 
gcttagcctc 
ctccaaatat 
aagaaaaagc 
aagatggctc 
cttttgtttt 
tacacagaac 
caggaagagt 
cctatattgt 
agtctatcca 
cccttataac 
tgctgacttg 
aactttctga 
tctcacactt 
ggcaaagaaa 
ctctgtctga 
cggggcctag 
ttcagacagc 
ttggatttgc 
gaagacgaag 
ataaagatga 
ggagtggcat 
caatttctct 
atttagattc 
aagtttcaaa 
aggcggtgag 
tccagggctg 
tttttatttt 
aaggactctg 
aagaatactg 
ggtgtacaac 
gaagatcaga 
ttttttggag 
ctcattttgt 
taaagatgga 
ccttcccacc 
ggtctcagta 
ttttttaagt 
ctccctctct 
caccatctca 
tctccttccc 
actgcaatat 
agccagggat 
gaaagaagct 
cctggcattg 
tgccagttgt 
gcacctcaat 
ttttaaagat 
tctttgcctt 
attcaatcta 
acaggggaaa 



acctgattct 
tggactaaca 
tcttccctac 
tttgaaagat 
aatgacttca 
ggatggggca 
agtcacagct 
tgaaggaata 
acagagtacc 
aagagttcag 
attgattcgg 
ttacattttg 
tctttcattt 
tttcagtgct 
aagatgtttc 
atacacttgg 
gattcatagt 
aacttctctc 
ttttttggag 
ttatgcattg 
gagagcagaa 
ctcctggcag 
cttacaattt 
tagcttgggt 
atttgcagcc 
ctcaccaagt 
agagcttgct 
tatttgtcag 
atctatccaa 
tctcaatcat 
ggctcttatg 
gagcgagaca 
tgtaaatgaa 
tacccaagat 
aagttacaac 
ccgtccttgt 
ctgtttcaag 
tgtaacacta 
cttctatttt 
aggtcctagc 
atgaatgcag 
gtgtttgttt 
gaaagaaatg 
tttccctggg 
tccaacctag 
gatacttctt 
ggcacggtct 
ttttcctctg 
gaggttgagg 
acatcctcag 
gtgcttcatg 
acaacaaaat 
gggtttgtgg 
gccagggatc 
ctagtgacat 
ctttaattgc 
gaatccccct 
ctaaatatac 
agaaatggtt 
aagtaattgc 



atcacttctt 
actactccag 
tctgttatca 
gccactgagg 
aatgaaataa 
actgttcttt 
ctttcaccca 
gaaaaatcat 
ctgaaaactt 
atcactgcaa 
atgccttatg 
gattatctga 
atgaggcaag 
tttgggaatt 
cttgaagccg 
cttaaaggac 
gagcttcaag 
ctgggatata 
tctgaattca 
tcatcagtgg 
caagaaggtg 
ccacgctccc 
cagacttctg 
ggttttgcat 
ctaatgaata 
cctgtaaagt 
gtggtacagc 
ctcaatgttg 
aatcaagaag 
gtggatttga 
gaagttaacc 
gtgaagaaag 
acccagtttt 
gcttcagtgt 
tctgaagtga 
gaggatggag 
cttctgtact 
acatttccag 
gaaaaaagag 
atgtatagct 
ttgtgtgtct 
tctccagaat 
aacctagatt 
aatgggagaa 
ccctactgcc 
aactggaaat 
ttttctgctt 
tagagaaatg 
agcatactga 
ccccacacca 
ggatttcgat 
actaggtaag 
agaatcagag 
ctgtggaacc 
ctgatgcttg 
cctgtattcc 
tttttctttt 
tgaaatgatt 
tagtttttct 
catgggctcc 



gggtggctac 
tggagctcca 
gaggtgaaga 
ttaaggtaat 
atgccacagg 
ttcccatcag 
ctgcttctga 
attcacaatc 
tgagtttctc 
ttggagatgt 
gctgtggtga 
ctaaaaagaa 
gttaccagag 
atgacccttc 
atccttacat 
atcagaaatc 
gtggcaataa 
gaaagtatca 
gtagaggaat 
ggagtcctaa 
gcatgcaatt 
tggatattga 
agggaatccc 
ctactcagga 
cagaaaggac 
ttctgattga 
caatggcagt 
tatataatgt 
cctttgattt 
atgtgtgtac 
tattaagtgg 
tggaatatga 
gtgttaatat 
ccatagtgga 
agctgtcctc 
cttcaggctc 
ttatggaact 
tagtcacatg 
ttttttttct 
gcatagattt 
atattttccc 
aaaggtatta 
cttaagcatt 
aacagccagc 
caccccaccc 
tctttctttt 
gaaatctgat 
tgaggggcag 
aaattgccct 
gctctatttc 
tcgaagatcc 
tcactgcaga 
catcttgaca 
tcttctagtt 
ctgtgaactt 
gaagggtaat 
cttctctctt 
tagatatgtg 
ctttagctct 
aaagaatttg 
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5821 ctttatgttt ttagctattt aaaaataaat ccatcaaaaa taaagtatgc aaatgtatct 
5.881 ttt 
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>AY149920 ACCESSION:AY149920 NID: gi 37359235 gb AY149920.1 Homo 
sapiens activated T-cell marker CD109 {CD109) mRNA, 
complete cds 
Length = 4338 

Score = 2845 bits (7294), Expect = 0.0 

Identities = 1423/1445 (98%), Positives = 1425/1445 (98%), Gaps = 17/1445 (1%) 
Frame = +1 

Query: 1 MQGPPLLTAAHLLCVCTAALAVAPGPRFLVTAPGIIRPGGNVTIGVELLEHCPSQVTVKA 60 

MQGPPLLTAAHLLCVCTAALAVAPGPRFLVTAPGIIRPGGNVTIGVELLEHCPSQVTVKA 
Sb j c t : 1 MQGPPLLTAAHLLCVCTAALAVAPGPRFLVTAPGI IRPGGNVTIGVELLEHC PSQVTVKA 180 

Query: 61 ELLKTASNLTVSVLEAEGVFEKGSFKTLTLPSLPLNSADEIYELRVTGRTQDEILFSNST 120 

ELLKTASNLTVSVLEAEGVFEKGSFKTLTLPSLPLNSADEIYELRVTGRTQDEILFSNST 
Sbjct: 181 ELLKTASNLTVSVLEAEGVFEKGSFKTLTLPSLPLNSADEIYELRVTGRTQDEILFSNST 360 

Query: 121 RLSFETKRISVFIQTDKALYKPKQEVKFRIVTLFSDFKPYKTSLNILIKDPKSNLIQQWL 180 

RLSFETKRISVFIQTDKALYKPKQEVKFRIVTLFSDFKPYKTSLNILIKDPKSNLIQQWL 
Sbjct: 361 RLSFETKRISVFIQTDKALYKPKQEVKFRIVTLFSDFKPYKTSLNILIKDPKSNLIQQWL 540 

Query: 181 SQQSDLGVISKTFQLSSHPILGDWSIQVQVNDQTYYQSFQVSEYVLPKFEVTLQTPLYCS 240 

SQQSDLGVISKTFQLSSHPILGDWSIQVQVNDQTYYQSFQVSEYVLPKFEVTLQTPLYCS 
Sbjct : 541 SQQSDLGVISKTFQLSSHPILGDWSIQVQVNDQTYYQSFQVSEYVLPKFEVTLQTPLYCS 720 

Query : 241 MNSKHLNGTITAKYTYGKPVKGDVTLTFLPLSFWGKKKNITKTFKINGSANFSFNDEEMK 300 

MNSKHLNGTITAKYTYGKPVKGDVTLTFLPLSFWGKKKNITKTFKINGSANFSFNDEEMK 
Sbjct: 721 MNSKHLNGTITAKYTYGKPVKGDVTLTFLPLSFWGKKKNITKTFKINGSANFSFNDEEMK 900 

Query: 301 NVMDSSNGLSEYLDLSSPGPVEILTTVTESVTGISRNVSTNVFFKQHDYIIEFFDYTTVL 360 

NVMDSSNGLSEYLDLSSPGPVEILTTVTESVTGISRWSTNVFFKQHDYIIEFFDYTTVL 
Sbjct: 901 NVMDSSNGLSEYLDLSSPGPVEILTTVTESVTGISRNVSTNVFFKQHDYIIEFFDYTTVL 1080 

Query: 361 KPSLNFTATVKVTRADGNQLTLEERRNNWITVTQRNYTEYWSGSNSGNQKMEAVQKINY 420 

KPSLNFTATVKVTRADGNQLTLEERRNNWITVTQRNYTEYWSGSNSGNQKMEAVQKINY 
Sbjct: 1081 KPSLNFTAWKVTRADGNQLTLEERRNNWIWTQRNYTEYWSGSNSGNQKMEAVQKIOT 1260 

Query: 421 TVPQSGTFKIEFPILEDSSELQLKAYFLGSKSSMAVHSLFKSPSKTYIQLKTRDENIKVG 480 

TVPQSGTFKIEFPILEDSSELQLKAYFLGSKSSMAVHSLFKSPSKTYIQLKTRDENIKVG 
Sbjct: 1261 TVPQSGTFKIEFPILEDSSELQLKAYFLGSKSSMAVHSLFKSPSKTYIQLKTRDENIKVG 1440 

Query: 481 SPFELWSGNKRLKELSYMWSRGQLVAVGKQNSTMFSLTPENSWTPKACVIVYYIEDDG 540 

SPFELWSGNKRLKELSYMWSRGQLVAVGKQNSTMFSLTPENSWTPKACVIVYYIEDDG 
Sbjct: 1441 SPFELWSGNKRLKELSYMWSRGQLVAVGKQNSTMFSLTPENSWTPKACVIVYYIEDDG 1620 

Query: 541 EIISDVLKIPVQLVFKNKIKLYWSKVKAEPSEKVSLRISVTQPDSIVGIVAVDKSVNLMN 600 

EIISDVLKIPVQLVFKNKIKLYWSKVKAEPSEKVSLRISVTQPDSIVGIVAVDKSVNLMN 
Sbjct: 1621 EIISDVLKIPVQLVFKNKIKLYWSKVKAEPSEKVSLRISVTQPDSIVGIVAVDKSVNLMN 1800 

Query: 601 ASNDITMENWHELELYNTGYYLGMFMNSFAVFQECGLWVLTDANLTKDYIDGVYDNAEY 660 

ASNDITMENVVHELELYNTGYYLGMF+NSFAVFQECGLWVLTDANLTKDYIDGVYDNAEY 
Sbjct: 1801 ASNDITMENWHELELYNTGYYLGMFINSFAVFQECGLWVLTDANLTKDYIDGVYDNAEY 1980 

Query : 661 AERFMEENEGHIVDIHDFSLGSSPHVRKHFPETWIWLDTNMGYRIYQEFEVTVPDSITSW 720 

AERFMEENEGHIVDIHDFSLGSSPHVRKHFPETWIWLDTNMG RIYQEFEVTVPDSITSW 
Sbjct: 1981 AERFMEENEGHIVDIHDFSLGSSPHVRKHFPETWIWLDTNMGSRIYQEFEVTVPDSITSW 2160 



Query: 


721 


Sbjct : 


2161 


Query : 


781 


Sbjct: 


2341 


Query: 


841 


Sbjct: 


2521 


Query: 


901 


Sbjct: 


2701 


Query: 


961 


Sbjct: 


2881 


Query : 


1021 


Sbjct: 


3061 


Query: 


1081 


Sbjct: 


3241 


Query: 


1141 


Sbjct: 


3421 


Query: 


1201 


Sbjct: 


3601 


Query: 


1244 


Sbjct: 


3781 


Query: 


1304 


Sbjct: 


3961 


Query: 


1364 


Sbjct: 


4141 


Query: 


1424 


Sbjct: 


4321 



VATGFVISEDLGLGLTTTPVELQAFQPFFIFLNLPYSVIRGEEFALEITIFNYLKDATEV 



KVIIEKSDKFDILMTSSEINAT HQQTLLVPSEDGATVLFPIRPTHLGEIPITVTALSPT 



ASDA+TQMILVKAEGIEKSYSQSILLDLTDNRLQSTLKTLSFSFPPNTVTGSERVQITAI 



GDVLGPSINGLASLIRMPYGCGEQNMINFAPNIYILDYLTKKKQLTDNLKEKALSFMRQG 



YQRELLYQREDGSFSAFGNYDPSGSTWLSAFVLRCFLEADPYIDIDQNVLHRTYTWLKGH 



QKSNGEFWDPGRVIHSELQGGNKSPVTLTAYIVTSLLGYRKYQPNIDVQESIHFLESEFS 



RG I SDNYTLAL ITYAL S SVGS PKAKEALNMLTWRAEQEGGMQFWVS S ESKLSDSWQPRSL 



DIEVAAYALLSHFLQFQTSEGIPIMRWLSRQRNSLGGFASTQDTTVALKALSEFAALMlSrT 



ERTNIQVTVTGPSSPSP LAWQP AVNISANGFGFAICQLNW 



YIWKASGSSRRRRSIQNQEAFDLDVAVKENKDDLNHVDLNVCTSFSGPGRSGMALMEVNL 



LSGFMVPSEAISLSETVKKVEYDHGKLNLYLDSVNETQFCVNIPAVRNFKVSNTQDASVS 



IVDYYEPRRQAVRSYNSEVKLSSCDLCSDVQGCRPCEDGASGSHHHSSVIFIFCFKLLYF 
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AY149920 4338 bp mRNA linear PRI 06-APR-2004 

Homo sapiens activated T-cell marker CD109 (CD109) mRNA, complete 
cds . 

AY149920 

AY149920.1 01:37359235 

Homo sapiens (human) 
Homo sapiens 

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi ; 
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. 

1 (bases 1 to 4338) 

Solomon, K.R. , Sharma,P., Chan,M. , Morrison, P .T . and Finberg,R.W. 
CD109 represents a novel branch of the 
alpha2-macroglobulin/complement gene family 
Gene 327 (2), 171-183 (2004) 

14980714 

2 (bases 1 to 4338) 

Solomon, K., Sharma,P., Morrison, P. and Finberg,R.W. 
Direct Submission 

Submitted (12-SEP-2002) Medicine, UMass Medical School, 364 
Plantation Street, Worcester, MA 01605, USA 

Location/Qualifiers 

1. .4338 

/organism="Homo sapiens" 

/mol_type="mRNA" 

/ db_xr e f = " t axon : 9 6 0 6 " 

/chromosome= " 6 " 

/map="6q" 

/cell_line="U373" 

1. .4338 

/gene="CDl09" 

1. .4338 

/gene="CD109" 

/note="AMC0M; alpha2 -macroglobulin/ complement ; 

GPI -anchored to membrane; hematopoetic stem cell marker; 

inducible by PMA, LPS and phytohemaglutinin" 

/codon_start=l 

/product=" activated T-cell marker CD109" 
/protein_id= " AAN78483 .1 " 
/db_xref="GI : 37359236" 

/ t r ans 1 a t i on= " MQGPPLLTAAHLLCVCTAALAVAPGPRFLVTAPGI IRPGGNVTI 

GVELLEHCPSQVTVKAELLKTASNLTVSVLEAEGVFEKGSFKTLTLPSLPLNSADEIY 
ELRVTGRTQDEILFSNSTRLSFETKRISVFIQTDKALYKPKQEVKFRIVTLFSDFKPY 
KTSLNILIKDPKSNLIQQWLSQQSDLGVISKTFQLSSHPILGDWSIQVQWDQTYYQS 
FQVSEYVLPKFEVTLQTPLYCSMNSKHLNGTITAKYTYGKPVKGDVTLTFLPLSFWGK 
KKNITKTFKINGSANFSFNDEEMKNVMDSSNGLSEYLDLSSPGPVEILTTVTESVTGI 
SRJWSTNVFFKQHDYIIEFFDYTTVLKPSLNFTATVKVTRADGNQLTLEERRNNWIT 
VTQRNYTEYWSGSNSGNQKMEAVQKINYTVPQSGTFKIEFPILEDSSELQLKAYFLGS 
KSSMAVHSLFKSPSKTYIQLKTRDENIKVGSPFELWSGNKRLKELSYMWSRGQLVA 
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VGKQNSTMFSLTPENSWTPKACVIVYYIEDDGEIISDVLKIPVQLVFKNKIKLYWSKV 
KAEPSEKVSLRISVTQPDSIVGIVAVDKSVNLMNASNDITMENWHELELYNTGYYLG 
MFINSFAVFQECGLWVLTDANLTKDYIDGVYDNAEYAERFMEENEGHIVDIHDFSLGS 
SPHVRKHFPETWIWLDTNMGSRIYQEFEVTVPDSITSWVATGFVISEDLGLGLTTTPV 
ELQAFQPFFIFLNLPYSVIRGEEFALEITIFNYLKDATEVKVIIEKSDKFDILMTSSE 
INATSHQQTLLVPSEDGATVLFPIRPTHLGEIPITVTALSPTASDAITQMILVKAEGI 
EKSYSQSILLDLTDNRLQSTLKTLSFSFPPNTVTGSERVQITAIGDVLGPSINGLASL 
IRMPYGCGEQNMINFAPNIYILDYLTKKKQLTDNLKEKALSFMRQGYQRELLYQREDG 
SFSAFGNYDPSGSTWLSAFVLRCFLEADPYIDIDQNVLHRTYTWLKGHQKSNGEFWDP 
GRVIHSELQGGNKSPVTLTAYIVTSLLGYRKYQPNIDVQESIHFLESEFSRGISDNYT 
LALITYALSSVGSPKAKEALNMLTWRAEQEGGMQFWVSSESKLSDSWQPRSLDIEVAA 
YALLSHFLQFQTSEGIPIMRWLSRQRNSLGGFASTQDTTVALKALSEFAALMNTERTN 
IQVTVTGPSSPSPVKFLIDTHNRLLLQTAELAWQPTAVNISANGFGFAICQLIWVYN 
VKASGSSRRRRSIQNQEAFDLDVAVKENKDDLNHVDLNVCTSFSGPGRSGMALMEVNL 
LSGFMVPSEAI SLSETVKKVEYDHGKLNL YLDSVNETQFCVNI PAVRNFKVSNTQDAS 
VSIVDYYEPRRQAVRSYNSEVKLSSCDLCSDVQGCRPCEDGASGSHHHSSVIFIFCFK 
LLYFMELWL" 



ORIGIN 

1 

61 
121 
181 
241 
301 
361 
421 
481 
541 
601 
661 
721 
781 
841 
901 
961 
1021 
1081 
1141 
1201 
1261 
1321 
1381 
1441 
1501 
1561 
1621 
1681 
1741 
1801 
1861 
1921 
1981 
2041 
2101 
2161 
2221 
2281 
2341 
2401 
2461 



atgcagggcc 
gccgtggctc 
aatgtgacta 
gage tgc tea 
gaaaaaggct 
atttatgagc 
egcttatcat 
aagecaaagc 
aaaacetctt 
tcacaacaaa 
cttggtgaet 
gtttcagaat 
atgaattcta 
aaaggagacg 
aeaaaaaeat 
aatgtaatgg 
gtagaaattt 
aatgtgttct 
aagccatctc 
aetettgaag 
tactggagcg 
aetgteeeee 
ctacagttga 
aagtctccta 
tcgccttttg 
gtatecaggg 
ccagaaaatt 
gaaattataa 
ctatattgga 
acaeagcetg 
geetetaatg 
tattatttag 
ttgaeagatg 
getgagaggt 
ggtageagtc 
atgggttcca 
gtggetaetg 
gagctccaag 
ggtgaagaat 
aaggtaatca 
gccacaagcc 
cccatcaggc 



caeegctcet 
ccgggcctcg 
ttggggtgga 
agaeagcatc 
cttttaagac 
tacgtgtaac 
ttgagaccaa 
aagaagtgaa 
taaaeattet 
gtgatcttgg 
ggtetattea 
atgtattacc 
agcatttaaa 
taacgcttac 
ttaagataaa 
attcttcaaa 
taaccacagt 
tcaagcaaca 
teaaetteae 
aaagaagaaa 
gatctaacag 
aaagtggaae 
aggectattt 
gtaagacata 
agttggtggt 
gacagttggt 
cttggaetce 
gtgatgttct 
gtaaagtgaa 
actecatagt 
atattacaat 
gcatgttcat 
caaaceteac 
ttatggagga 
caeatgtecg 
ggatttacca 
gttttgtgat 
ecttccaacc 
ttgctttgga 
ttgagaaaag 
accagcagac 
caacacatct 



gaecgeegce 
gtttctggtg 
gcttctggaa 
aaacetcact 
acttactctt 
cggacgtacc 
gagaatatct 
gtttcgcatt 
eattaaggac 
agtcatttcc 
agttcaagtg 
aaaatttgaa 
tggtaccatc 
atttttacct 
tggatctgea 
tggactttct 
gacagaatca 
tgattacatc 
agccactgtg 
taatgtagte 
tggaaatcag 
ttttaagatt 
ccttggtagt 
catecaacta 
tagtggcaac 
ggctgtagga 
aaaagectgt 
aaaaatteet 
agctgaacca 
tgggattgta 
ggaaaatgtg 
aaattctttt 
gaaggattat 
aaatgaagga 
aaageatttt 
agaatttgaa 
ctctgaggac 
atttttcatt 
aataactata 
tgacaaattt 
ccttctggtt 
gggagaaatt 



eacctectct 

acagccceag 
eaetgecett 
gtetetgtcc 
ccatcactac 
caggatgaga 
gtctteattc 
gttacactct 
eeeaaateaa 
aaaacttttc 
aatgaccaga 
gtgactttgc 
acggcaaagt 
ttatcctttt 
aacttetctt 
gaatacctgg 
gttacaggta 
attgagtttt 
aaggtaaete 
ataaeagtga 
aaaatggaag 
gaattcccaa 
aaaagtagca 
aaaacaagag 
aaacgattga 
aaacaaaatt 
gtaattgtgt 
gttcagcttg 
tctgagaaag 
getgttgaca 
gteeatgagt 
gcagtctttc 
attgatggtg 
eatattgtag 
ccagagaett 
gtaaetgtac 
ctgggtcttg 
tttttgaatc 
tteaattatt 
gatattctaa 
cccagtgagg 
cctatcacag 



gcgtgtgcac 

ggatcatcag 
eaeaggtgac 
tggaagcaga 
ctctgaacag 
ttttattctc 
aaacagacaa 
tctcagattt 
atttgateca 
agctatcttc 
catattatca 
agacaceatt 
atacatatgg 
ggggaaagaa 
ttaatgatga 
atetatettc 
tttcaagaaa 
ttgattatac 
gtgctgatgg 
eacagagaaa 
ctgttcagaa 
tcetggagga 
tggcagttca 
atgaaaatat 
aggagttaag 
caacaatgtt 
attatattga 
tttttaaaaa 
tctctcttag 
aaagtgtgaa 
tggaacttta 
aggaatgtgg 
tttatgacaa 
atattcatga 
ggatttggct 
etgattetat 
gactaacaac 
ttccctactc 
tgaaagatgc 
tgacttcaag 
atggggcaac 
tcacagctct 



cgccgcgetg 
gcccggagga 
tgtgaaggeg 
aggagtcttt 
tgcagatgag 
taatagtacc 
ggcettatac 
taagcettae 
aeagtggttg 
ccatccaata 
atcatttcag 
atattgttct 
gaagccagtg 
gaaaaatatt 
agagatgaaa 
ccctggaeca 
tgtaagcact 
tactgtcttg 
caaccaactg 
etatactgag 
aataaattat 
ttccagtgag 
tagtctgttt 
aaaggtggga 
ctatatggta 
ctctttaaea 
agatgatggg 
taagataaag 
gatctctgtg 
tctgatgaat 
taaeaeagga 
actctgggta 
tgeagaatat 
cttttetttg 
agacaccaac 
eaettcttgg 
tactccagtg 
tgttateaga 
caetgaggtt 
tgaaataaat 
tgttcttttt 
ttcacceact 
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2521 
2581 
2641 
2701 
2761 
2821 
2881 
2941 
3001 
3061 
3121 
3181 
3241 
3301 
3361 
3421 
3481 
3541 
3601 
3661 
3721 
3781 
3841 
3901 
3961 
4021 
4081 
4141 
4201 
4261 
4321 



gcttctgatg 
tcacaatcca 
agtttctcat 
ggagatgttc 
tgtggtgaac 
aaaaagaaac 
taccagagag 
gacccttctg 
ccttacatag 
cagaaatcca 
ggcaataaaa 
aagtatcagc 
agaggaattt 
agtcctaaag 
atgcaattct 
gatattgaag 
ggaatcccaa 
actcaggata 
gaaaggacaa 
ctgattgaca 
acggcagtta 
tataatgtga 
tttgatttag 
gtgtgtacaa 
ttaagtggct 
gaatatgatc 
gttaatattc 
atagtggatt 
ctgtcctcct 
tcaggctccc 
atggaacttt 



ctatcaccca 
tcttattaga 
ttcctcctaa 
ttggtccttc 
agaacatgat 
aactgacaga 
aacttctcta 
ggagcacttg 
atattgatca 
acggtgaatt 
gtccagtaac 
ctaacattga 
cagacaatta 
cgaaggaagc 
gggtgtcatc 
ttgcagccta 
ttatgaggtg 
ccactgtggc 
atatccaagt 
cacacaaccg 
atatttccgc 
aggcttctgg 
atgttgctgt 
gcttttcggg 
ttatggtgcc 
atggaaaact 
ctgctgtgag 
actatgagcc 
gtgacctttg 
atcatcactc 
ggctgtga 



gatgatttta 
cttgactgac 
tacagtgact 
catcaatggc 
aaattttgct 
taatttgaaa 
tcagagggaa 
gttgtcagct 
gaatgtgtta 
ttgggatcca 
acttacagcc 
tgtgcaagag 
tactctagcc 
tttgaatatg 
agagtccaaa 
tgcactgctc 
gctaagcagg 
tttaaaggct 
gaccgtgacg 
cttactcctt 
aaatggtttt 
gtcttctaga 
aaaagaaaat 
cccgggtagg 
ttcagaagca 
caacctctat 
aaactttaaa 
aaggagacag 
cagtgatgtc 
ttcagtcatt 



gtaaaggctg 
aataggctac 
ggcagtgaaa 
ttagcctcat 
ccaaatattt 
gaaaaagctc 
gatggctctt 
tttgttttaa 
cacagaacat 
ggaagagtga 
tatattgtaa 
tctatccatt 
cttataactt 
ctgacttgga 
ctttctgact 
tcacacttct 
caaagaaata 
ctgtctgaat 
gggcctagct 
cagacagcag 
ggatttgcta 
agacgaagat 
aaagatgatc 
agtggcatgg 
atttctctga 
ttagattctg 
gtttcaaata 
gcggtgagaa 
cagggctgcc 
tttattttct 



aaggaataga 
agagtaccct 
gagttcagat 
tgattcggat 
acattttgga 
tttcatttat 
tcagtgcttt 
gatgtttcct 
acacttggct 
ttcatagtga 
cttctctcct 
ttttggagtc 
atgcattgtc 
gagcagaaca 
cctggcagcc 
tacaatttca 
gcttgggtgg 
ttgcagccct 
caccaagtcc 
agcttgctgt 
tttgtcagct 
ctatccaaaa 
tcaatcatgt 
ctcttatgga 
gcgagacagt 
taaatgaaac 
cccaagatgc 
gttacaactc 
gtccttgtga 
gtttcaagct 



aaaatcatat 
gaaaactttg 
cactgcaatt 
gccttatggc 
ttatctgact 
gaggcaaggt 
tgggaattat 
tgaagccgat 
taaaggacat 
gcttcaaggt 
gggatataga 
tgaattcagt 
atcagtgggg 
agaaggtggc 
acgctccctg 
gacttctgag 
ttttgcatct 
aatgaataca 
tgtaaagttt 
ggtacagcca 
caatgttgta 
tcaagaagcc 
ggatttgaat 
agttaaccta 
gaagaaagtg 
ccagttttgt 
ttcagtgtct 
tgaagtgaag 
ggatggagct 
tctgtacttt 
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Cell surface antigen CD109 is a novel member of the alpha(2) 
macroglobulin/C3, C4^ C5 family of thioester-containing proteins. 

Lin M, Sutherland DR, Horsfall W, Totty N, Yeo E, Nayar R, Wu XF, 
Schuh AC. 

Department of Medical Biophysics, University of Toronto, ON, Canada. 

Cell surface antigen CD 109 is a glycosylphosphatidylinositol (GPI)-linked 
glycoprotein of approximately 170 kd found on a subset of hematopoietic stem 
and progenitor cells and on activated platelets and T cells. Although it has been 
suggested that T-cell CD 109 may play a role in antibody-inducing T-helper 
function and it is known that platelet CD 109 carries the Gov alloantigen system, 
the role of CD 109 in hematopoietic cells remains largely unknown. As a first 
step toward elucidating the function of CD 109, we have isolated and 
characterized a human CD109 cDNA from KGla and endothelial cells. The 
isolated cDNA comprises a 4335 bp open-reading frame encoding a 1445 amino 
acid (aa) protein of approximately 162 kd that contains a 2! aa N-terminal 
leader peptide, 17 potential N-linked glycosylation sites, and a C-terminal GPI 
anchor cleavage-addition site. We report that CD109 is a novel member of the 
alpha 2 macroglobulin (alpha 2M)/C3, C4, C5 family of thioester-containing 
proteins, and we demonstrate that native CD 109 does indeed contain an intact 
thioester. Analysis of the CD 109 aa sequence suggests that CD109 is hkely 
activated by proteolytic cleavage and thereby becomes capable of thioester- 
mediated covalent binding to adjacent molecules or cells. In addition, the 
predicted chemical reactivity of the activated CD 109 thioester is complement- 
like rather than resembling that of alpha 2M proteins. Thus, not only is CD 109 
potentially capable of covalent binding to carbohydrate and protein targets, but 
the t(l/2) of its activated thioester is likely extremely short, indicating that 
CD 109 action is highly restricted spatially to the site of its activation. 
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• DNA, Complementary/analysis 

• DNA, Complementary/genetics 

• DNA, Complementary/isolation & purification 

• Glutamine 

• Glycosylphosphatidylinositols/chemistry 

• Hematopoietic Stem Cells/chemistry 

• Hematopoietic Stem Cells/immunology 

• Human 

• Molecular Sequence Data 

• Sequence Alignment 

• Sequence Analysis, DNA 
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• Support, Non-U.S. Gov't 

• Tumor Cells, Cultured 

• Variation (Genetics) 

• alpha-Macroglobulins/chemistry* 
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• alpha-Macroglobulins/metabolism 

Substances: 

• Antigens, CD 

• CD 109 antigen, human 

• DNA, Complementary 

• Glycosylphosphatidylinositols 

• Sulfides 

• alpha-Macroglobulins 

• Cysteine 
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CD109 represents a novel branch of the alpha2- 
macroglobulin/complement gene family. 

Solomon KR, Sharma P, Chan M, Morrison PT, Finberg RW. 

Department of Orthopaedic Surgery, Children's Hospital, Boston, MA 02115, 
USA. 

We report here the genomic organization and phylogenic relationships of 
CD 109, a member of the the alpha2-macroglobulin/complement (AMCOM) 
gene family. CD 109 is a GPI-linked glycoprotein expressed on endothelial cells, 
platelets, activated T-cells, and a wide variety of tumors. We cloned full-length 
CD 109 cDNA from the mammalian U373 cell line by RT-PCR and performed 
analysis of its corresponding genomic sequence. The CD 109 cDNA spans 128 
kb of chromosome 6q with its 33 exons constituting approximately 3.3% of the 
total CD109 genomic sequence. Sequence analysis revealed that CD109 
contains specific motifs in its N-terminus, that are highly conserved in all 
AMCOM members. CD 109 also shares motifs with certain other AMCOM 
members including: (1) a thioester 'GCGEQ" motif, (2) a furin site of four 
positively charged amino acids, and (3) a double tyrosine near the C-terminus. 
Based on a phylogenic analysis of human CD 109 with other human homologs 
as well as orthologs from other mammalian species, C. elegans (ZK337.1) and 
E. coli homologs, we propose CD 109 represents a novel and independent 
branch of the alpha2-macroglobulin/complement gene family (AMCOM) and 
may be its oldest member. 

MeSH Terms: 
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• Antigens, CD/chemistry 

• Antigens, CD/genetics* 

• Antigens, CD/metabolism 
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• Human 
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• Molecular Sequence Data 

• Multigene Family/genetics 
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• Sequence Alignment 

• Sequence Analysis, DNA 

• Sequence Analysis, Protein 

• Sequence Homology, Amino Acid 

• Support, Non-U.S. Gov't 

• Support, U.S. Gov't, P.RS. 

• alpha-Macroglobulins/genetics* 

Substances: 

• Antigens, CD 

• CD109 antigen, human 

• DNA, Complementary 

• alpha-Macroglobulins 

• Complement 

• Phosphatidylinositol Diacylglycerol-Lyase 

Secondary Source ID: 

• PIR/AY149920 

Grant Support: 

• ROl GM63244/GM/NIGMS 

PMID: 14980714 [PubMed - indexed for MEDLINE] 
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ABH antigens on human platelets: expression on the glycosyl 
phosphatidylinositol-anchored protein CD109. 

Kelton JG, Smith JW, Horsewood P, Warner MN, Warkentin TE, Finberg 
RW, Hay ward CP- 
Department of Medicine, McMaster University, Hamilton, Ontario, Canada. 

Platelets express alloantigens that are platelet specific (eg, the HPA antigens) 
and alloantigens that are shared with other blood cells (eg, the ABH antigens). 
The blood group A and B determinants are expressed on glycolipids and on ^ 
some intrinsic platelet membrane glycoproteins. This report characterizes 
multiple platelet proteins reacting with blood group antibodies in serum samples 
from mothers of children bom with neonatal alloinmiune thrombocytopenia. 
ABH antigens on additional platelet proteins are identified, including the 
glycosyl phosphatidyUnositol-anchored protein CD 109. The proteins that carry 
ABH antigens were identified by using monoclonal antibodies to glycoproteins 
lb, nb/IIIa, la/IIa, CD31, and CD 109 and immunoprecipitation/inmiunoblotting 
techniques with monoclonal antibodies to A and B antigens. The maternal 
serum samples and anti-A and anti-B monoclonal antibodies 
immunoprecipitated identical radiolabeled platelet proteins including proteins at 
220 and 175 kd and proteins with mobilities corresponding to glycoproteins lb, 
Ilb/IIIa, IV, and V. Treatment of platelets with phosphatidylinositol-specific 
phospholipase C released into the supernatant a 175-kd protein that expressed 
the blood group determinants. This protein comigrated with the glycosyl 
phosphatidylinositol-anchored protein CD109. When platelet proteins were 
purified by immunoprecipitation with monoclonal antibodies and then tested by 
immunoblotting, anti-A reacted with the glycosyl phosphatidylinositol-anchored 
protein CD109 and to glycoproteins lb, lib, Ila, Ilia, and CD31 (PECAM). 
These results indicate that structures for modification by glycosyltransferases 
exist on platelet CD 109, which also expresses the Gov alloantigen system. This 
study indicates that certain platelet proteins express both platelet-specific and 
blood group antigens that may contribute to platelet transfusion refractoriness 
and to neonatal alloimmune thrombocytopenia. 
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Detection of Gov system antibodies by MAIPA reveals an 
immunogenicity similar to the HPA-5 alloantigens. 

Berry JE, Murphy CM, Smith GA, Ranasinghe E, Finberg R, Walton J, 
Brown J, Navarrete C, Metcalfe P, Ouwehand WH. 

Division of Haematology, National Institute for Biological Standards and 
Control, Potters Bar, UK. 

The glycosylphosphatidylinositol -linked platelet protein CD 109 carries the 
biallelic alloantigen system Gov. There is limited information on the incidence 
of Gov alloantibodies in neonatal alloinmiune thrombocytopenia (NAITP), 
post-transfusion purpura (FTP) and platelet refractoriness. We adapted the 
monoclonal antibody-specific immobilization of platelet antigens (MAIPA) 
assay to the detection of Gov antibodies and determined their incidence in 605 
archived samples (112 with HPA antibodies) referred for the aforementioned 
conditions. Here, we show that CD109 expression was reduced upon platelet 
storage in saline or by cryopreservation, but was stable when stored as whole 
blood or therapeutic platelet concentrate. Fourteen of the 605 samples contained 
Gov alloantibodies (anti-Gova, n = 10; anti-Govb, n = 4), with the majority in 
platelet refractoriness (n = 9) and, of the remaining five, four in NAITP and one 
in PTP. In seven cases, no other HPA antibodies were detected, three being 
NAITP cases. The incidence of Gov antibodies was significantly lower than 
HPA-1 system antibodies (n = 87), but equalled the number of HPA-5 system 
antibodies (n = 14) and outnumbered HPA-2 and -3 system antibodies (10 
altogether). 

PMID: 10997989 [PubMed - indexed for MEDLINE] 
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THE HUMAN GENOME 

A 2.91-billion base pair (bp) consfe . s sequence of the euchromatic portion of 
the human genome was generated by the whole-genome shotgun sequencing 
method. The U.S-billion bp DNA sequence was generated over 9 nrionths from 
27.271,853 high-quality sequence reads (5.11-fold coverage of the genome) 
from both ends of plasmid clones made from the DNA of five Individuals. Two 
assembly strategies-a whole-genome assembly and a regional chromosorne 
assembly-were used, each combining sequence data f/om Celera and the 
publicly funded genome effort. The public data were shredded mto 550-bp 
segments to create a 2.9-fold coverage of those genome regions that had been 
sequenced, without including biases inherent in the cloning and assembly ■ 
procedure lised by the publicly funded group. This brought the effective cpy-v . ;. 

• erage in the assemblies to 
the final assembly over what would be obtained with .5.11 -fold coverage. The 
two assembly strategies yielded very similar results that largely agree with 
; • Independent mapping data. The assemblies effectively cover the euchromatic 

-regions of the human chromosomes. More than 90% of the genome is in 
scaffold assemblies of 100.000 bp or more, and 25% of the genome is in 
scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 
26.588 protein-encoding transcripts for which there was strong corroborating 
evldenceandanadditional~12.000computationallyderivedgeneswithmouse 

matches or other weak supporting evidence; Although gene-dense clusters ar^ 
obvious, almost half the genes are dispersed in low G-HC sequence separated 
by large tracts of apparently noncoding sequence. Only 1.1% of the genome 
Is spanned by exons. whereas 24% Is In Introns. withT5% of the genome being 
intergenic DNA. Duplications of segmental blocks, ranging In size up to chro- . 
mosomal lengths, are abundant throughout the genome and reveal a complex 
evolutionaiy history. Comparative genomic analysis indicates vertebrate ex- 
pansions of genes associated with neuronal function, with tissue-specific de- 
velopmental regulation, and with the hemostasis and Immune systems. DNA 
sequence comparisons between the consensus sequence and publicly funded 
genomedataprovidedlocationsof2.1millionsirigle-nucle6tldepolymor^^^^^ 
(SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 
1250 on average, but there was marked heterogeneity In the level of poly- 
. morphism across the genome. Less than 1 % of all SNPs resulted in variation In . 
proteins.butthetaskofdeterminingwhichSNPshave;fun<^ional^ 

remains an open challenge. . ; .. 



Decoding of the DNA that constitutes the 
human genome has been widely anticipated 
for the contribution it will make toward un-. 
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derstanding human evolution, the causation 
of disease, and the interplay between , the 
environment and heredity in defining the hu- 
man condition. A project with the goal of 
determining the complete nucleotide se- 
quence of the human genome was-first for- 
mally proposed in 1985 (/). In subsequent 
years, the idea met with mixed reactions in ^ 
the scientific community (2). However, in 
1990, the Human Genome Project (HGP) was 
officially initiated in the United States under 
the direction of the National Institutes of 
Health and the U.S. Department of Energy 
with a 15-year, $3 billion plan for completing 
the genome sequence. In 1998 wq announced 
our intention to build a unique genome- 
sequencing facility, to detennine the se- 
quence of the human genome over a 3-year 
period. Here we report the penultimate mile- 
stone along the path toward that goal, a nearly 
complete sequence of the euchromatic por- 
tion of the human genome. The sequencing 
was performed by a whole-genome random 
shotgun method with subsequent assembly of 
the sequenced segments. 

Hie modem history of DNA sequencing 
began in 1977, when Sanger reported his meth- 
od for determining the order of nucleotides of 



AA using chain-terminating nucleotide ana- 
logs (5). In the same year, the first human gene 
was isolated and sequenced {4). In 1986, Hood . 
and co-workers (3) described an improvement 
in the Sanger sequiencing method that included 
attaching fluorescent, dyes to the nucleotides, 
which permitted them to be sequentially read 
by a computer. The first automated DNA se- 
. quencer, developed by Applied Biosystems in 
California in 1987, was shown to be successful 

• when'the sequences of tw;p genes were obtained 

y with this new technology (5). Froin ead/ sct. : 
queiicing of haiman' genomic regions (7), it 
became clear lliat cDNA sequences (which are 
'Teyerse-transcribed from !^^^* 

■ seiitial to annotate and validate gene predicticttis 
in the human genome: These studies were ttie 
basis in part for the development of . the ex- 
pressed sequence tag (EST) method of gene 
identificiation (8), which is a random selection, . 

. very high thiau^put sequencing approach to 

> characterize cDNA libraries. .The EST method 
led to the rapid discovery and mapping of hu- 
man genes (50- The. increasing numbers of hu- 
man EST sequences necessitated the develop- 

• ment of new computer algorithms to analyze 
: large amounts of sequence.data, and in 1993 at 
The Institute for Genomic Research (TIGR), an 
algorithm was developed that permitted assem- 

. bly and analyas of hundreds of thousands of 

; *ESTs. This algorithm permitted characteriza- 
tion and annotation of human genes on the basis 
of 30,000 EST assemblies (iO). 

The complete 49-kbp bacteriophage lamb- 

. ' da genome sequence was determined /by a 
shotgun restrfction ' digdist method in 1 982 
ill). When cbmsidering miethods for sequenc- 
ing the smallpox virus genome in 1991 (72), 
a whole-genome shotgun sequencing method 
was discussed and subsequently rejected ow- 
ing to the lack of appropriate software tools 
for genome assembly. However, in 1994, 

, when a_ microbial genome-sequencing project 
was contemplated at TIGE; a whole-genome 
shotgun sequencing approach was considered 

. .. possible with the TIGR EST assembly algo- 
rithm. In 19S>5, the L8-Mbp Haemophilus 
influenzae genome was completed by a 
whole-genome $hotgun sequencing method 
(13). The experience with several subsequent 
genorne-sequencing efforts, established the 
broad applicability of this approach {14, 15). 

A key feature of the sequencing approach 
used for these megabase-size and larger ge- 
nomes was the use of paired-end sequences 
(also called mate pairs), derived from sub- 
clone libraries with distinct insert sizes and 
cloning characteristics. Paired-end sequences 
are sequences 500 to 600 bp in length from 
both ends of double-stranded DNA clones of 
prescribed lei«ths. The success of using end 
sequences from long segments (18 to 20 kbp) 
of DNA cloned into bacteriophage lambda in 
assembly of the microbial genomes led to the 
suggestion {16) of an approach to simulta- 
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neously map and sequence the human ge- 
nome by means of end sequences from 150- 
kbp bacterial artificial chromosomes (BACs) 
(i7, 18). The end sequences spanned by 
known distances provide long-range continu- 
ity across the genome. A modification of the . . 
BAG end-sequencing (BBS) method was ap- 
plied successfully to complete chromosome 2 ;. 
from the Arabidopsis thaliana genome {19). f 
. ;In 1997. Weber and Myers (20) proposed 
whole-genome shotgun sequencing, of 'the . 
human genome. Their proposal was not well 
received (27). However, by early 1998, as v 
less than 5% of the genome had been se- 
quenced, it was clear that the rate of progress 
in human geinome sequencmg worldwide . 
was very slow (22), and the prospects for 
finishing the genome by the 2005 goal were 

uncertain. , * t- j 

In early 1998, PE Biosystems (now Applied 
Bio^stems) developed an automated, high- 
throughput capillary DNA : sequencer, , subse- • : 
quenjr called the ABI PRISM 3700: DNA- : 
Analyzer. Discussions between PE Biosystems 
and TIGR scientists resulted in a plan to under- 
take the sequencing of the human genome with 
the 3700 DNA Analyzer and the whole-genome 
shotgun sequencmg techniques developed at 
TIGR (25). Many of the principles of operation .\ 
of a genome-sequencing faciUty were estab- . 
lished in tiieHGR facility (2<). However, the 
facility eiivisioned for Celera would have a 
ijapacity roughly 50 times that of TIGR, and 
thus new developments were required for sam- 
ple preparation and tracking and for whole- 
genome assembly. Some argued that the re- 
quired 150-fold scale-iqi from the H, influenzae 
genome to Ae human genome with its comj^x. 
repeat sequences was not feasible (2i). The 
DrosophUa melanogaster genome was thus 
chosen as a test case for whole-genome assem- 
bly on a large and complex eukaryotic genome. 
In collaboration with Gerald Rubin and the 
Berkeley DrosophUa Genome Project, the nu- 
cleotide sequence of the 120-Mbp euchromatic 
portion of the DrosophUa genome was deter- 
mined over a 1-year period {26-28). Tlie Dro- 
^opAiVa genome-sequencing effort resulted m 
two key findings: (0 that the assembly algo- 
rithms could generate chromosome assemblies 
with highly accurate order and orientation with 
substantially less than 10-fold coverage, and (ii) 
that undertaking multiple interim assemblies m 
place of one comprehensive final asseipbly was 
not of value. : : - 

. . Hiese findings, together with fte dramatic 
changes in the public genome effort subsequent 
to the formation of Celera (2P), led to a modi- 
fi^ whole-genome shotgun sequencing ap- 
. proach to the human genome. We initially pro- 
posedlo"do"10^f61d"sequeiice-coverage-of-tiie- 
genome over a 3-year period and to make in- 
terim assembled sequence data available quar- 
terly. The modifications included a plan to per- 
form random shotgun sequencing to -5-fold 
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coverage and to use the unordered and unori- : 

■ ented BAG sequence fragments and subassem- 
blies published in GenBank by the pubUcly 
funded genome effort {3G) to accelerate the , 
project We also abandoned the quarterly an-,;^> 

, nouncements in the absence of interim assem-:v?^ 
rblies to report. • - 

/ ^Although.this : strategy provided a reason--.^ 
% able result very early that was consistent with a 
whole-genome : shotgun , assembly with/, eight- 
/fold coverage, the human genome sequ^ce is 
•<not as finished as ih^ DrosophUa genome was . 
'With an effective 13-fold coverage. However, it 
became clear that even with this reduced cov- 

■ erage strategy, Celera could generate an accu^.. 
^ ; rately ordered and oriented scaffold sequence of . 
. r;Ae human genome in less than 1 ye^. Human : 
- genome sequencing was initiated 8 September, 
, 1999 and completed 17. June 2000. The first . 

assembly was completed 25 June 2000, and the 
assembly reported here was completed 1 Octo- 
ber 2000. Here we describe the whole-genome . 
• random shotgun sequencing effort applied to 
the human genome. We developed two differ- 
ent assembly approaches for assembling the -3 
bUlion bp that make up the 23 pairs of dirOTio- 
somes oiiht Homo sapiens genome. Any Gen- . 
-Bank-derived data were shredded to remove 
^ potential bias to the fmal sequence fix)m chi-: . 
meric clones, foreign DNA contaminaUon, or 
misassembled contigs: : Insofar -as .a correctly - 
and accurately^ assembled /genome =sequ.encc 
with faithfiil order and orientation of contigs 
is essential for an accurate analysis of the 
humari genetic code, we have devoted a con- 
siderable portion of this manuscnpt to the 
documentation of the quality of our recon- 
struction of the genome. We also descnbe.our 
preliminary analysis of the human genetic 
code on the basis of computational metiiods. . 
Figure 1 (see fold-out chart associated with • 
this issue; files for each chromosome can be 
found in Web fig. 1 on Science <^l|ne at 
www.sciencemag.org/cgi/content/fiill/291/ 
5507/1304/DCl) provides a graphical over- 
view of the genome and tiie features encoded 
in it. The detailed manual curation and mterr 
pretation of tiie genome are just beginning. 

To aid the reader in locating specific an- 
alytical sections, we have divided the paper 
into seven broad sections. A summary of the 
major results appears at the beginning of each 
section. 

1 Sources of DNA ahd Sequencing Methods- 

2 Genome Assembly Strategy and 
Characterization 

3 Gene Prediction and Annotation 

4 Genome Structure 

5 Genome Evolution 
— 6-A-Genome-Wide-Examination.oL, 

Sequence Variations 



7 An Overview of the Predicted Protem- 
Coding Genes in the Huriian Genome 

8 Conclusions 



1 Sources of DNA and Sequencing 
Methods ^ 

&»7imary. This section discusses the rationale 
..and ethical mles governing donor sclcciion to 
, ensure ethnic and gender diversity along u-ji}, 
l.tiae methodologies for .DNA extraction aiid Ii-' 

brary/.constnictiorL The plasmid library cotv 
. stniction is the .first • critical step in shotgun 
. sequencirigi If the DNA libraries are hot uni« 
1 farm in sizei, nonchimeric, and do not randomly 
:rqpresent the genome, tiien the subseqiient steps ' 

cannot accurately reconstruct .the genome sc» 
.qoence. We used automated high-throughput ^ 

DSNA sequencing and the computational infra* 
> structure. to;. enable efficient: tracking of cnor*;, 

■ nious amouiiits.of sequence, information (27.3 ' ' 
. malliori sequence reads; 14.9 billion bp ofsc* ' 

qmence). Sequencing .and tracking from both ' 
emds of plasmid clones from 2-, 10-; and 50-kbp 

■ libraries; were '-essential to the computational 
. reconstruction of the genome. Our evidence 
-indicates that, the accurate. pairing.rate of end 

sequences was greater flian 98%. 

. Various policies of tiie United States and the 

, Worid Medical Association; specifically the . 
Declaration of Helsmki; offer recommcnda- v 

• tions for conducting experiments with human 
siAjects. We convened, an Institutional Rc-. 

.view Board. (IRB) (37) that helped us estab-. - 
lishihe protocol for obtaining and using hu- ^ 

Mttan DNA and tiie informed consent process 
used to '.enroll research volunteers for the 
DlNA-sequencing studies reported here. We 
adlopted several steps and procedures to pro- 
tect the privacy ri^ts and confidentiality of 
thffi research subjects (donors). These includ- 
ed! a two-stage consent process, a secure raii- 
diam alphanumeric coding system for spcci- 

. nttns and records, circumscribed contact with 

. tlKB subjects , by . researchers, and options for 
oS'-site contact of donors. In addition, Celera 
applied for and received a Certificate of Con- 
fidentiality from tiie Department of Hcalih 
amd Human Services. This Certificate autho- 
ri2zed Celera to protect tiie privacy of the 
individuals who volunteered to be donors as 
pEOvided in Section 301(d) of the Tublic 
Ifealtii Service Act 42 U.S.C. 241(d). 

Celera and tiie IRB believed tiiat the ini- 
tial version of a completed human genome 
sfciould be a composite derived from multiple 
donors of diverse etimic backgrounds Pro- 
qoective donors were asked, , on a voluntary 
bmsis, to self-designate an ethnogeographic 
category (e.g., African-American, Chinese. 
Hfispanic, Caucasian, etc). We enrolled 

donors (52). . a- m 

Three basic items of information trom 
each donor were recorded and linked by con- 
Me ntial code to tiie donated sample: age. 
sex, and self^eSgnated--etimOgeographic 
gBOup. From females. -130 ml of whole, 
heparinized blood was collected. From males. 
—130 ml of whole, heparinized blood was 
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collected, as well as five specimens of se? ' 
collected over a 6-week period. Permaiivut 
lymphoblastoid cell lines were created by 
Epstein-Barr vims immortalization. DNA 
from five subjects was selected for genomic 
DNA sequencing: two males and three fe- 
males — one African-American, one Asian- 
Chinese, one Hispanic-Mexican, and two 
Caucasians (see Web fig. 2 on Science Online 
at www.sciencemag.org^cgi/content/291/5507/ 
^ 1304/DCl).' The decision of .whose DNA to 
. sequence was biased on a comiplex mix of fac-: 
■ 'tors, including the goal of iachieving diversity as 
:WeIl as technical issues such as the quality of 
- the DNA libraries and availability of immortal- " 
■ ■ ized cell lines. • - " • : ./ v. •. • ■ 

1.1 Library construction and 
sequencing 

. Central to the whole-genome shotgun sequenc- 
ing process is preparation of high-qxiality plas- 

mid libraries in a variety of insert sizes so that . 
pairs of sequence reads (mates) are obtained, 
one read from both ends of each plasmid insert. 
High-quality libraries have an equal representa- 
tion of all parts of the genome, a small number 

: of clones without inserts, and no contamination 

> from siich sources as the mitochondrial gehbme - 

• and Escherichia coii genomic DNA. DNA from 
each donor was used to construct plasmid librar- 

• ies in one or more of three size classes: 2 kbp, 10 
kbp, and 50 kbp (Table 1) (33), 

In designing the DNA-sequencing pro- 
cess, we focused on developing a 'simple 
system that could be implemented in a robust *v 
and reproducible manner and nionitored ef-: 
fectively (Fig. 2) (5^). 
' Current sequencing protocols are based on 



the dideoxy sequencing method (ii), which 
typically yields only 500 to 750 bp of sequence 
per reaction. This limitation on read length has 
made monumental gains in throughput a pre- 
requisite for the analysis of. large eukaryotic 
genomes. We accomplished this at the Celera 
facility, which occupies about 30,000 square 
feet of laboratory space and produces sequence 
. data continuously at a rate of 175,000 total 
.. reads per day. .The DNA-sequencing fricility is 
.'^fijsuppprted by a high-performance cpmputation-.: 
'vi'al facility / -^/L' /^^rv-^:!^?:v:{f^i: 
;^.r •>; rThe process for DNA sequencing was iriod- 
.» ular . by design and automated. Intermodule 
• vsample .backlogs . allowed „four principal . 
vvj^nioduleis td tbperate independently: (i) • li- ^ 

briary. transformation/ plating, and colony 
: -picking; ' (ii)'^ DNA ^ template preparation;, 
(iii) dideoxy . sequencing reaction set-up o 
iand purification; and (iv) sequence deter-- 
mination with the ABI PRISM 3700 DNA 

♦ Analyzer." Because the inputs and outputs 
; of each niodule have been carefiilly = 
. matched and sample backlogs are continu- . 
. ously managed, sequencing has proceeded ^ 
^'without a single day's interruption since the 

/initiation of the Dro^op^i/fl project in May . 
:Vl999. -The *ABI 3700 is a fully automated > 

capillary array sequencer and as such can - 

be operated - with a minimal amount/of- 
;i hands-on time^ currently estimated at about • 

15 min.per day. The capillary system also 

* facilitates correct , associations of sequenc- 
-Ing traces with saniples through the elimi- 

: • nation of. manual sample loading and lane-, .a 
.. tracking errors associated with slab gels: 
About 65 production staff were hired and 
.' trained, and were rotated on a regular basis 



rough the four production modules. A 
central laboratory information management 
system (LIMS) tracked all sample plates by 
- . unique bar code identifiers. The facility was 
.'.. supported by a quality control team th^^t per- 
formed raw material and in-process testing 
and a quality assurance group with responsi- 
f- bilities including document control, valida- 
•vtion, and auditing.of the facility. Critical to 
the success of the scale-up was the validation 
.'• of '.all software and ..instrumentation before ; 

implementation; and production-sc testing . , 
^ i.of any process chaiiges.' 

'^1.2 Trace processing ; ; V 

' An automated trace-processing pipeline has ; 
vbeen developed to process.eaich sequence file , 
.; (57). After quality and vector ..trimming, the. 
average trimmed . sequence length was 543 
i bp,: and the sequencing accuracy , was expo-, 
nentially distributed with a mean of 99.5% . 
and with less than ;l in 1000 reads being less ^ 
than 98% accurate (2tf):"Each~ trimmed se- 
quence was screened for matches to contam- 
inants including sequences of vector alone, E,. , 
co// genomic DNA, and human mitochondri- 
: al DNA. The entire read . for any sequence -. 
;with'a significant match to a contaminant was 
. discarded. A . total of 1 13 reads niatched . K .. 
co/i genomic DNA and 2114 reads matched . 
:.the human mitochondrial genome. . ; . . . 



1.3 Quality assessment and^control . 
The importance of the base-pair, level ac- . 
curacy of the sequence data Increases as. the.: 
size and repetitive nature of the geriome to 
be sequenced ;rincreases.:.>Each - sequence 
read must be'placed uniquely in the. ger 



Table 1. Celera-generated data input into assembly. 



Individual 




Number of reads for different Insert libraries 




Total number of 


2 kbp 


10 kbp 


50 kbp 


Total 


base pairs 


A 


0 


0 


2.767.357 


. 2.767.357 •• 


1.502.674,851 


B 


11,736.757 


7.467,755 


66,930 


19.271,442 


10.464.393.006 


C 


853.819 


881.290 


0 


1.735.109 


942.164.187 


D 


952.523 


1.046,815 


0 


1.999338 


1.085.640.534 


F 


0 


1.498.607 


0 


1.498.607 


813,743,601 


Total 


13.543.099 


10.894.467 


2.834.287 


27.271.853 


14.808,616.179 


A 


0 


- 0 


0.52 


0.52 


■ • , "'I 


B 


2.20 


. 1.40. 


0.01 . 


. 3.61. 






^" 0.16 


1.17^ 


0 


*0.32 




D 


0.18 


0.2b 


0 


037 




F 


0 


. 028 


0 . 


0.28 




Total 


2.54 


2.04 


0.53 


5.11 




A 


0 


0 


18.39 


1839 




B 


2.96 


11.26 


0.44 


14.67 




C 


0L22 


133 


0 


1.54 




D 


0.24 


1.58 


0 


1.82 




F 


0 


2.26 


0 


226 




Total 


3.42 


16.43 


18.84 


38.68 




Average 


1.951 bp 


10.800 bp 


50.715 bp 






Average 


6.10% 


8.10% 


14.90% 






Average 


7430 


80.80 


75.60 







No. of sequencing reads 



Fold sequence. coverage 
(2.9>Gb genome) . 



Fold clone coverage 



Insert size* (mean) 
Insert size* (SD) 
%. Mutest 



^Insert size and SD are calculated from assembly of mates on contlgi t% Mates b based on laboratory tracking of sequencing runs. 



www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 



i3or 



nome, and even a modest error rate can 
reduce the effectiveness of assembly. In 
addition, maintaining the validity of mate- 
pair information is absolutely critical for. 
the algorithms described below. Procedural 
controls were established for maintaining 
•the validity of sequence mate-pairs a? se-j 
qucricing reactions proceeded through the 
process, including strict rules built into the 
XIMS. The accuracy of sequence data pro- 
t duced by the Celera process was validated; 
in the course of Xho Drosophila genome;, 
project (2d). By collecting data for the;: 
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THE HUMAN GENOME 

entire human genome in a single facility/ 
we were able to ensure uniform quality 
r standards and the cost advantages associat- 
ed with automation, an economy of scale,-, 
and process consistency. 
2 Ceriome Assenibly .Strategy, and. >^ 
. Characterization - ■[, -',:^r:,:-]:};r-'<^i:'^i''^^ 
. -v^umiKflrv- We describe in this. section the;two : 
: .-c. approaches that we used to assemble ;,the ge-; 
inome. One method involves the computational ^ 
, coihbinatioh of all sequence reads with shred-. 
%ded data fiom GenBank to geneiate.an indepen- 



Human Samples 

[Medical Affairs] 



-sample screening 



Tissue Samples 

PNA Resources] 



PmcessManagi^^ 



: dent, nonbiased view of the genome. The sec- 
;ond approach involves clustering all of the frag- 
■ ments to a region or chromosome on tihe basis 
. of . mapping information- The clustered data 
■■. were then shredded and subjected to computa- 
,'Vtional assembly.: Both iapprbaches' provided es- 
r^sentially the same reconstmction of: assembled 
'^/DNA sequence .with proper .onler and oriented 
?. -^tion^ 'vThe;: second ' method ;>i)r6vided T slightly 
greater sequence coverage'^fewer.;^^ iand 
. was the principal sequence used for the analysis 
^phase. In addition, we document tiie complete- 
, ness and correctness of .this .assembly ^^^^^ 

J, Pdtentlai Exit Points 



: Workflow Process 

1 



. DNA/RNA;;. 
pJNARosoinces] 



DNA/RNA(ExlemaI) : 
[DNA Resources] 


, QC: size & concentration ^ 















■ , Libraries ; : 

[DNAResouhes]' 



QC: titer A functional test 



Fluorescently Labeled 
DNA 

[Pfe-Sequencing Lab] 



i;^ - Libraries > 
' pNA Resources] 



i Fluorescently Labeled ' 

V.:-"- •'^.^'.;DNA■• ■■: 



Trace Files [UNIX] 
[SequendngLab] 



- validate trace files 

- load QCDS quality Info 



^^^mmm^. vector s contaminant 
lPost-Sequencingjlg-3 sc reening ■ 



r QC: monitor statistical 
summary data ^ 


Trace Files [NT) 

[Sequencing Lab] 













Trimmed Fragments 
• [Content Systems] 




Fig. 2. Flow diagram for sequencing pipeline. Samples ^^J^^^l^i 
selected, and processed In compUance with standard opefat'"g P™"' 
dures with a focus on quaUty within and across departments. Each 
Jri^ess h« denned Inpul3 and outputs with the capability to exchange 



samples and data with both Internal and external entitles aecording to 

deffi quality guidelines. Manufacturing pipeline pwcesses. produrts. 
quaUty control measures, and responsible parties are indicated and are 
described further in the text. 



and provide a comparison to the public gen^ : 
sequence, which was reconstructed largely - j 
an independent BAC-by-BAC approach. Our 
assemblies effectively covered the euchromatic 
regions of the human chromosomes. More than 
90% of the genome was in scaffold assemblies 
of 100,000 bp or greater, and 25% of the ge- 
nome was in scaffolds of 10 million bp or 



•Shotgiin : sequence assembly is a classic 
.example of an inverse problerii; given a set/ 
' of reads randomly sampled from 'a target 
sequence, reconstruct the order and the po- 
sition of those reads in the target. Genome 
^'assembly 'algorithms .developed , for Pro- 
sophila h2ivc now been extended to assemble'- 
. the .^25-fold larger human genome. Celera as- . 
semblies consist of a set of contigs that are 
ordered and oriented into scaffolds that are then 
mapped to chromosomal locations -by using 
'■ known markers. The contigs consist of a col-, 
lection of overlapping sequence reads that pro- 
vide a consensus reconstruction for a contigu- 
ous interval of the genome. Mate pairs are a 
- central component of the assembly strategy. 
They are used to produce scaffolds in which the 
•-size of gaps between consecutive contigs is 
known witfi reasonable precision. This is ac- 
complished by observing that a pair of reads, 
one of which is in one contig, and the other of 
which is in another, implies an orientation and 
distance between the two contigs (Fig. 3). Fi- 
nally, our assemblies did not incoiporate all 
reads into the final set of reported scaffolds. 
This set of unincorporated reads is termed 
"chaff," and typically consisted of reads from 
within highly repetitive regions, data from other 
organisms introduced through various routes as 
found .in.znany genome projects, and data of 
poor quality or with untrinuned vector. 



IHfc HUMAN UtNUMt 

2.1 Assembly data sets 

We used two independent sets of data for our 
assemblies. The first was a random shotgun 
data set of 27.27 million reads of average length 
543 bp- produced at Celera.' This consisted 
largely of mate-pair reads from 16 libraries 
constmcted from DNA samples taken from five 
. different donors. Libraries with insert sizes of 2, 
10, and 50 kbp were used. By looking at how 
mate pairs fk)m a library were positioned in 
. . known sequenced stretches of the genome, we 
/ were able , to characterize, the larige of insert 
\;sizes. in ieach library arid determine a mfeain and' 
. V standard deviation.' Table 1 details the number 
';-,ofrea,ds, sequenciiig coverage, and clone cov-; 
^ ; erage achieved by the data set Hie clone cov- 
>^ ;;erage is' the. (coverage of the genome in cloned 
. * DNAi . considering flie entire insert of each 
: "done that has:sequence from both ends: The 
. clone: coverage provides a measure of the 
amount of physical DNA coverage of the ge- 
nome! Assuming a genomiesize of 2.9 Gbp, the 
Celera trinmied sequences gave a 5.1 X cover-., 
age of the genome, and clone coverage was 
3.42X; 16.40X, and 18.84X for the 2-, 10-, and 
^50-kbp.hbraries, respectively, for a total of 
. 38.7X.clone coverage. 
• .:v: :The second data set yras from the publicly , 
funded Human Genome Project (PFP) and is 
primarily derived from BAG clones (30). The 
. BAG. data input to the aissemblies came from a 
: download of GenBarik on 1 September 2000 
' (Table -2) totaling 44433 Mbp of sequence. 
. :.The data for .^h BAG is dq)osited at one of 
four levels of c^rnpletioa; Phaie 0 data are a s^t , 
: ' of ^generaliy -liiiassembled 'sequericing vreads" 
.from a very light shotgun of the BAG, typically - 
! . less than IX. Phase 1 data are unordered as- 
semblies of contigs, which we call BAG contigs . 
or bactigs. Phase^2 data are ordered assemblies^ 
of bactigs. Phase 3 data are complete BAG 
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Mapped 
Scaffolds: 



.'Genome 



t 



- r •V-T--t -* ^-r -T^ -7 - r 



Scafifold: 



t 




Read pair (mates) 



, Gap (mean & std. dev. Known) 



Contig: 



Consensus 



""SJT" — Reads (of several haplotypes) 

• SNPs 

— BAG Fragments 

Fig. 3. Anatomy of whole-genome assembly. Overlapping shredded bactig fragments (red lines) and 
Intemally derived reads from five different Individuals (black lines) are combined to oroduce a 
contig and a consensus sequence (green line). Contigs are connected Into scaffolds (red) by usine 
mate pair Information. Scaffolds are then mapped to the genome (gray line) with STS (blue star) 
physical map Information. 



(' 'luencesL In the past 2 years the PFP has 
• *ocused on a product of lower quality and com- 
pleteness, but on a faster time-course, by con- 
centrating on the production of Phase 1 data 
^ .from a 3X to 4X light-shotgun of each BAG 
. clone. . 

r We sdneened the bactig sequences for con- 
s :tarainants by .using the: BLAST algorithm 
r against three data sets: (i) vector sequences 
in Univec core (38), filtered for a 25-bp 
,W; match at .98% sequence identity at the ends .-. 
V: of the sef^nence and a 3 0-bp match internal . 
to the seqinence;v(ii) the:ribnhumah/porti6n.':'27 
. r of the High Throughput Genomic :.(^^ 

• •Seqences - division of. GenBank (5P), fil- 

:: tered at 2O0 bp at 98%; and (iii) the non- . 
. : redundant nucleotide sequences from Gen- • 
;Bank without primate and hum^ .virus en-. . . 
. tries, filtered at 200 bp at 98%. Whenever 
25 bp or more of vector was found ■within 

• 50 bp of tSie end of a contig, the tip up to : " 
. the matching vector was excised. Under. 

. ithese.;crit«ia we removed 2.6 Mbp of pos- .. . 

sible contaminant- and vector , from the 
- Phase 3 data, 61.0 Mbp from the Phase 1 
: and 2 data, and 16.1 Mbp from the Phase, 0 
: data (Table 2). This left us with a total of 

• 4363.7 Mbp of PFP. sequence data 20% . 
. fmished, 75% rough-draft (Phase 1 and 2), 

V and 5% siiagle sequencing reads (Phase 0). 
. An additiomal 104,018 BAG. end-sequence 
. mate pairs were also downloaded and in- 
cluded in tOie data sets for both , assembly 
. processes C^^v 

>: 2.2 Assemibly strateglejs • . ''.^ 
" v Two different approaches to assembly were 
. pursued. TBie first was a whole-genome as- 
sembly proc:ess that used Gelera data and the 
PFP data in the form of additional synthetic 
shotgun data, and the second was a compart- 
mentalized assembly process that first parti- 
tioned the Celera and PFP data into sets 
. localized to Harge chromosomal segments and 
. then performed ab initio shotgun assembly on 
. each set. F%ure 4 gives a schematic of the 
overall process flow. 

For the whole-genome assembly, the PFP 
data was firsft disassembled or "shredded" into a 
synthetic shoAgun data set of 550-bp reads ^t 
form a perfect 2X covering .o£the bactigs. This 
resulted in 16.05 millioii "faux" reads that were 
sufficient to covef'the genome 2.96X because ' 
of redundancy in the BAG data set, without 
incorporatii^g the biases inherent in the PFP 
assembly pioqess. The combined data set of 
43 32 milliom reads (8X), and all associated 
mate-pair in&nnation, were then subjected to 
our whole-gpnome assembly algorithm to pro- 
duce a reconstniction of the genome. Neither 
the location of a BAG in the genome nor its 
assembly of bactigs was used in this process. 
Bactigs were shredded into reads because we 
found strong evidence that 2.13% of them were 
misassembled (4(Jf). Fuithennore, BAG location 
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information was ignored because some BACS' 

were not correctly placed on the PFP Physical 1"^ ZIt^^s^p^ir^cting errors 

mapandbecausewefoundstiongevidencethat y .possibly as a result oi samp 



. . I n en w r» 

* ^^r. cAniiPnr^ • ^ ::fsee below). In short, we performed a taic, ab 

atleast2.2% oftheBAa «,ntm^^^^ .;..Stio -.Avhole.g«ioine assembly in >^«ch'*t 

data ftat were not part of the given BAC (iJ), ""^ .v.-^~.^;«,t «f Herivino a^Hif;««M 



Table Z GenBank data Input Into assembly. 



Completion phase sequence- 



Washlngt^in University. 
USA 



Baylor College of 
Medicine. USA 



Production Sequencing 
FaciUty.DOE Joint 
. Genome Institute. . 
USA 



The Institute of Physical 
and Chemical 
Research (RIKEN). 
Japan- 



Sanger Centre, UK 



Others* 



All centers combinedf 



Number of accession records 
Number of contigs 
Total basie pairs ' : 

Total vector masked (bp) 
Total contaminant masked 

(bp) 

'.Average contig length (bp) 
. Number of accession records 
Number of contigs 
Total base pairs • 
Total vector masked (bp) 
Total bontaminant masked 
(bp) 

Average contig length (bp) 

Number of accession records 

Number of contigs 

Total base pairs 
• Total vector masked (bp) 
. Total contaminant masked 
(bp) 

Average contig length (bp) 
, Number of accession records 

Number of contigs . 
•Total base pairs : y] ^ 

total vector masked (bp) ; 

;Total contaminant masked 

Average contig length (bp) 
Number of accession records 
Number of contigs 
Total base pairs 
Total vector masked (bp) 
Total contaminant masked (bp) 
Average contig length (bp) 
Number of accession records 
Number of contigs 
Total base pairs 
Total vector masked (bp) 
Tdtal contaminant masked (bp) 
Average contig length (bp) 
Number of accession records 
Number of contigs 
Total base pairs 
Total vector masked (bp) 
Total contaminant masked 
(bp) 

Average contig length (bp) 

Number of accession records 
Number of contigs 
. Total base pairs 
. . .Total vector masked (bp) . 
Total contaminant iiiasked 

Average contig length (bp) 



2.825 
243.786 
194,490;158 
; 1.553.597 
'13.654.482 



6.533 
138,023 
1.083.848.245 
875.618 > 
4,417,055/); 



;798 

•-.:fv19/-- 
.•2.127-; 
1.195,732 

21,604 . 

22.469 ' 

562 - 

■;■ ■'i 0 I 
0 

• : 0 
0 

. . ,0 



7,853; 



3,232 
'61,812 
561,171.788; 
: 270.942 
1.476.141 



134.516 

.1,300 
.1.300 
164.214.395 
8.287 
'469.487 



P 

135 
7.052 



: 9.079 ' 

S 1.626 
44.861 
265.547.066. 
- 218.769. 
■i : 1.7^700;, 

5,919 

.2.043 , 
34.938 



^ .126.319 

.-.! :- -363 
.'363 
49.017.104 
.4.960 
,.485.137 

135,033 

754 
754 



8.680:214r i:> 294.249;6311 -^J' ^0.975.328 ; 



.22.644 
r 665.818 

1.231 

0 
0 
0 
0 
0 
0 

0 
0 
0 
0 
0 
0 

42 
5.978- 
5.564.879 
57.448 
575.366 

931 

3.021 
258.943 
209.930.983 
1.655.293 
14.918.135 



. 162.651 
. 4,642.372; 

8.422 

1.149 
25.772 
182.812.275 
203.792 
308.426 
7,093 

4.538 
74,324 
689.059.692 
427,326 
2.066.305 
9.271 

1.894 



7.274 
118,387 



80.867 

300 
300 

20,093.926 
2.371 
27.781 
66.978 

2,599' 
2.599 
246.118,000 
25.054 
374.561 
94.697 

3.458 

29!898 3,458 

283.358.877 246,474,157 



279.477 
1.616.665 

9.478 

21.015 
409.628 
3.360,047.574 
2.438.575 
16.311.664 



32.136 
1.791.849 

71.277 

9.137 

9.137 

835.722.268 
82,284 
3365.230 



T otHer cente. contHtu^.n. 7 t least 0.1^ o^^^aTn.S: ' ct^^^^^ 

Cenomanalyje Cesellschaft fuer Blotechnotoglsche f<''«^""?. School of Medicine: Lawrence 

ah^-A»demy-of.ScienceRjnst!tute_^ 

^redded into faux re«b resulting In i96X cov*ra|e of the genome. 



- took the!=ejcpedient.;Of doiwig additional sc- . 

- quence coverage, but not mate pairs, assembled 
. bactigs,'or genome locality, from some cxicf- 

• i nally. generated data. 

In flie'compartmentalized; shotgun assembly 
(CSA), C^era and:PFP data .were partitioned , 

■ ^0 the laigest possible, duomosonial segmcntt 

• V or "<iomp6nehts" that could be detenriiiied with 
' confidenc^^and dien shotgun assembly was op- 
. J plied -to ■■ each partitioned ..subset wherein the 

- : bactig data were again shredded mtb faux read, 
to ensure am independent ab initioi: assembly of 

. the component By stibsetting the data in ihis 
'.vray the •wetiaU iMmputational effort was rc- 
• duced and Ae effect of interchromosomal dupli- 
. . cations was amcUorated This also resulted m a 

■ reconstmcfionofthe genome that was relatively 
'> independeit of the whole-genome assembly re- 
sults so that the two assemblies could be com- 

,.pared for consistency. 'Ihe quality of the parti- 
tioning ^imo icomponents -was .crucial so ^tl?:rt - 
Idifferentvgenome .regions were not^mixcd to- 
; gether We constnicted components from (D il>c 
c>.longest ..scaffolds of the .^sequence from cadi 
: • BAC and & assembled scaffolds of data unique 
to Celera'* data set TTie BAC assemblies wen: 
V obtainedbyacombiningassemblerthatusedthc 

ibactigs aS the SX.Celera data mapped to those 
S Snput^llds;^f!brt^-w^^ 

^Sinterim-^epsolelybeca^ethemore^J . 
,.a„d compfcte thescaffold for a g>ven 

. -s„etch.-tl« more accurately one can » 'e '^^^^^ 
scaffolds into contiguous «>™P°f 
basis of sequence overiap and mate-pair mio 
Son. ^ further visually inspected and 
raid the scaffold tiling of the comp°ncnte to 
S^her increase its accuracy. For the final CSA 
assembly,, all but the partitioning was ipioreA 
X dependent, ab initio reconstrudionoj 
. sequence in each component v^ob.am«i 
by applyiing our whole-genome ff^W^ "^.^^ 
rittun to tfee partitioned, relevant Celera data .u«i 

evant bac£g data. 

2.3 Whofle-genome assembly ^ 

The algorithms used for whole-gcn-onitM ^ 
^mblyW) of *e human genj^^^^^^^ 

enhancements to those used P"^^' ^cd 
sequence of the Drosophila genome tcpo 

in detail nn (25). . oipclii"-" 

. The WGA asseinbler cp|is.sts a . 
comboatS. Qf five principal steges. }^ 
S^Sappe^. Unitigger, Sc^plder, and^^^^^^^^^^^ 

Resolv^respectively. m S^-"^ „ 
and maifo all miciosatellite repeats wn ^ ^ 

&a„ a 6-bp clement, and screens , 
known iBtenpersed repeat j^„rW 
ing Alu, line, and nbosomal DNA. , 
regions ffiet searched for overlaps >v 

be part off an overlap that involves unscrv 
matching segments. 
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The Overlapper . compares every L 
against every other read in search of complete 
end-f o-end overlaps of at least 40 bp and with 
no more, than 6% differences in the match. 
Because all data are scrupulously vector- 
trimmed, the. Overlapper can insist on com- 
plete overlap matches. Computing^ the set of 
all overlaps, took roughly 10,000 CPU hours 
with a suite of four-processor Alpha SMPs 
:.' with 4. gigabytes of RAM. This .took 4 to 5 
> days in elapsed time with 40 such machines*: 
qper^tmg ln p3Lr&\ld- ^v. v 

Eyery overlap computed above is statisti- 
cally a I -in- 10^ ^ event and thus not a coinci- 
:> dental ie vent. What makes assembly, combi- 
natorially difficult is .that while many over-/ 
laps , are actually sampled from .overlapping 
regions of the genome, and thus imply , that 
the sequence reads should be assembled to- 
- gether, even more overlaps are actually, from . 
two distinct copies of a low-copy :repeated 
element not screened above, thus constituting 
an error if put together. We call the .former. 
. "true overlaps".and the latter "repeat-induced 
overlaps." The assembler .must avoid choos- 
; ing repeat-induced overlaps, especially early ; : 
in the process. 

We achieve this objective in the Unitig- 
ger. We first find all assemblies of reads that 
. appear to be uncontested with respect to all : 
other reads. We call the contigs formed from 
these subassemblies imitigs (for uniquely as- 
sembled. contigs). Formally, these unitigs are 
the uncontested interval subgraphs of the 
graph of all overlaps (42). Unfortunately, al- 
though empirically many of these assemblies 
are correct (and thus involve only true over- 
laps), some are in fact collections of reads 
from several copies of a repetitive element 
that have been overcollapsed into a single 
subassembly. However, the overcollapsed 
unitigs are easily identified because their av- 
erage coverage Bepth is too hi^ to be con- 
sistent with the overall level of sequence 
coverage. We developed a simple statistical 
discriminator that gives the logarithm of the 
odds ratio that a imitig is composed of unique 
DNA or of a repeat consisting of two or more 
copies. The discriminator, set to a sufficiently 
stringent threshold, identifies a subset of the 
unitigs that we are certain are correct. In 
addition, a second, less stringent threshold 
identifies a subset of remaining unitigs very 
likely to be correctly assembled, of which we 
select those that will consistently scaffold 
(see below), and thus are again almost certain 
to be correct. We call the union of these two 
sets U-unitigs. Empirically, we found from a 
6X simulated shotgxm of human chromosome 
22 that we get U-unitigs covering 98% of the 
stretches of imique DNA that are >2 kbp 
long. We are further able to identify the 
boundary of the start of a repetitive element 
at the ends of a U-unitig and leverage this so 
that U-unitigs span more than 93% of all 



aggressive and thus more likely to make a 
mistake. For the human assembly, we contin- 
ued to use the first "Rocks" substage where 
all unitigs with ia -good, * but not definitive, 
discriminator score ..are placed in a scaffold 
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singly interspersed Alu elements arid other 
100-to 400-bp repetitive segments. 

The result of tunning the Unitigger .was 
;. thus a set .of correctly assembled subcbntigs. 
covering an estimated 73.6% of the human 
genome.:.The Scaffolder then proceeded to V;*^ gap. 'This was done with the condition that 
use mate-pair -information to link these to- . ..ty/o or more mate pairs with one of their 
gether into scaffoMs. When there. are two dr?V:reads ah-eady in the ; scaffold unambiguously 
more mate pairs tfiat imply that a given pair . vplace the unitig in the given gap. We estimate 
. .of Uj-uiiitigs .;:are -aU a ^certaiii \distance ;and ■ v*>:; the. p^^^ insertiiig a imitig ^into an 

pn^ntation^ with J respect : to each other,-; the incorrect gap .with this strategy to be less than . . 
probability .'- of V thfe »being wrong • is ; again ; . 1 0::^ baised on a probabilistic analysis: - 
roughly .1 : in 10^^^- assuihing that mate pairs ; vWe revised the ensuing "Stones", substage 



of the himiah assembly; making it more like 
the mechanism suggested in' our earlier work 
(43): For each gap, every read R that is placed 
in the gap by virtue of its mated pair M bemg 



. . are false less than 2% of the time. Thus, one 
> -can /^.itifi high ^c link together all 

: Urunitigs that are Bnked by at least two 2-. or 
, .1 0-kbp . mate ,paiis. producing intermediate- 

^ sized vscaffolds daat are . then : recursively i in a contig of the scaffold and implying R's 
\ linked ^together by. confirming . 50-kbp mate a placement is collected Gelera's ihate-pairing 
pairs, and BAG endsequences.This process v. c information is correct mbre'thaii 99% of the 
yielded scaffolds fliat are on the order , of ^;v.; time. Thus, ^almost every, but not all, of the 
megabase pairs in size with gaps between ''reads in the set belong in the gap, and when 



. their contigs that generally correspond to re- : 
petitiye elements aaid occasionally to small ^ 
. sequencing gaps. Tbese scaffolds reconstruct v 
. the majority of , the Sonique sequence .within a )• 
genome. 

For the DrosopMIa assembly, we engaged 
in :a three-stage r^)eat . resolution strategy v> 
wherje. ; each • -stage . jsvas ;: progressively .more y: 



S.IIXCelera Reacte 
■ 39X mate pairs 



. .a read does not belong it rarely agrees with 
the remainder of the reads.- Therefore, we 
; > simply assemble this set of reads within the 
V gap^eliminating any reads that conflict with 
. the asseihbly. This operation proved much 
more reliable than the one it replaced for the 
■ iDrosophila assembly; in the assembly of a 
: simulated shotgun data set of human chromo- 



Publlc Bactlqs 
(from 33.421 BACs) 




Bactigs & Celera pairs 
(binned by BAC) 



Combining ^ 
Assembler 




Components^ 




Components^ 


• • 




• 
• 




^ Componwts„ 





WGA Assembly CSA Assembly 

Fig. 4. Architecture of CEelera's two-pronged assembly strategy. Each oval denotes a computation 
process performing the function Indicated by its label with the labeb on arcs between ovals 
describing the nature ^of the objects produced and/or consumed by a process. This figure 
summarizes the dlscussSon in the text that defines the terms and phrases used. ■ 
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AppSaial«1(M120,09S 
VytaOceetal. LEX-0282OSA 
Novel Human Alpha Macroflbbuiin Family Proteins and 



Polynucleotides Eneodbig the Same 



Query= SEQ ID NO: 3 

(4287 letters) 



Sequences producing significant alignments: 



Score E 
(bits) Value 



AL590428. 7. 1.163577 
AL591480. 8. 1.91419 



460 e-126 
349 9e-93 



>AL590428. 7. 1,163577 

Length = 163577 



Score = 460 bits (232), Expect 
Identities = 232/232 (100%) 
Strand = Plus / Plus 



e-126 



Query: 277 ctacctctgaacagtgcagatgagatttatgagctacgtgtaaccggacgtacccaggat 336 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 80550 ctacctctgaacagtgcagatgagatttatgagctacgtgtaaccggacgtacccaggat 80609 
Query: 337 gagattttattctctaatagtacccgcttatcatttgagaccaagagaatatctgtcttc 396 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 80610 gagattttattctctaatagtacccgcttatcatttgagaccaagagaatatctgtcttc 80669 
Query: 397 attcaaacagacaaggccttatacaagccaaagcaagaagtgaagtttcgcattgttaca 456 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 80670 attcaaacagacaaggccttatacaagccaaagcaagaagtgaagtttcgcattgttaca 80729 



Query: 457 ctcttctcagattttaagccttacaaaacctctttaaacattctcattaagg 508 

liiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 80730 ctcttctcagattttaagccttacaaaacctctttaaacattctcattaagg 80781 



Score = 456 bits (230), Expect = e-125 
Identities = 230/230 (100%) 
Strand = Plus / Plus 



Query: 2960 ggttgtcagcttttgttttaagatgtttccttgaagccgatccttacatagatattgatc 3019 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct : 157049 ggttgtcagcttttgttttaagatgtttccttgaagccgatccttacatagatattgatc 157108 
Query: 3020 agaatgtgttacacagaacatacacttggcttaaaggacatcagaaatccaacggtgaat 3079 

IIIMIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIillllll 

Sbjct : 157109 agaatgtgttacacagaacatacacttggcttaaaggacatcagaaatccaacggtgaat 157168 



Query: 3080 tttgggatccaggaagagtgattcatagtgagcttcaaggtggcaataaaagtccagtaa 3139 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct : 157169 tttgggatccaggaagagtgattcatagtgagcttcaaggtggcaataaaagtccagtaa 157228 



Query: 3140 



cacttacagcctatattgtaacttctctcctgggatatagaaagtatcag 3189 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 157229 cacttacagcctatattgtaacttctctcctgggatatagaaagtatcag 157278 



Score = 448 bits (226), Expect = e-122 
Identities = 226/226 (100%) 
Strand = Plus / Plus 



Query: 1108 gtgaaggtaactcgtgctgatggcaaccaactgactcttgaagaaagaagaaataatgta 1167 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct : 116136 gtgaaggtaactcgtgctgatggcaaccaactgactcttgaagaaagaagaaataatgta 116195 

Query: 1168 gtcataacagtgacacagagaaactatactgagtactggagcggatctaacagtggaaat 1227 

IIIIIIIIIIIIIIIIIIMIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 

Sbjct : 116196 gtcataacagtgacacagagaaactatactgagtactggagcggatctaacagtggaaat 116255 

Query: 1228 cagaaaatggaagctgttcagaaaataaattatactgtcccccaaagtggaacttttaag 1287 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct : 116256 cagaaaatggaagctgttcagaaaataaattatactgtcccccaaagtggaacttttaag 116315 



Query: 1288 attgaattcccaatcctggaggattccagtgagctacagttgaagg 1333 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 116316 attgaattcccaatcctggaggattccagtgagctacagttgaagg 116361 



Score = 432 bits (218), Expect = e-118 
Identities = 221/222 (99%) 
Strand = Plus / Plus 



Query: 2336 aggttaaggtaatcattgagaaaagtgacaaatttgatattctaatgacttcaagtgaaa 2395 

llllllllllllllllllllllllllllllllllllllllllllllllllllll Mill 

Sbjct : 137438 aggttaaggtaatcattgagaaaagtgacaaatttgatattctaatgacttcaaatgaaa 137497 
Query: 2396 taaatgccacaggccaccagcagacccttctggttcccagtgaggatggggcaactgttc 2455 

IIIIIIIIIIIIIIIIIMIIIIIIIIIIIIIIillllllllllllilllllllllllll 

Sbjct : 137498 taaatgccacaggccaccagcagacccttctggttcccagtgaggatggggcaactgttc 137557 



Query: 2456 tttttcccatcaggccaacacatctgggagaaattcctatcacagtcacagctctttcac 2515 

IMIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIMIIIMIIIIIIIIIII 

Sbjct: 137558 tttttcccatcaggccaacacatctgggagaaattcctatcacagtcacagctctttcac 137617 



Query : 
Sbjct: 



2516 



137618 



ccactgcttctgatgctgtcacccagatgattttagtaaagg 2557 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

ccactgcttctgatgctgtcacccagatgattttagtaaagg 137659 



Score = 385 bits (194), Expect = e-103 
Identities = 194/194 (100%) 
Strand = Plus / Plus 



Query: 3354 aggtggcatgcaattctgggtgtcatcagagtccaaactttctgactcctggcagccacg 3413 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct ; 160188 aggtggcatgcaattctgggtgtcatcagagtccaaactttctgactcctggcagccacg 160247 

Query: 3414 ctccctggatattgaagttgcagcctatgcactgctctcacacttcttacaatttcagac 3473 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct : 160248 ctccctggatattgaagttgcagcctatgcactgctctcacacttcttacaatttcagac 160307 

Query: 3474 ttctgagggaatcccaattatgaggtggctaagcaggcaaagaaatagcttgggtggttt 3533 

IIIIIIIIIIIIIMIIIIIIMIIIIIIIIIIIIIIIIIIIIIIIIIIIIIMIIIIM 

Sbjct : 160308 ttctgagggaatcccaattatgaggtggctaagcaggcaaagaaatagcttgggtggttt 160367 



Query: 3534 tgcatctactcagg 3547 

MMIMIIIIIII 

Sbjct: 160368 tgcatctactcagg 160381 



Score = 357 
Identities = 
Strand = Plus 



bits (180), Expect 
180/180 (100%) 
/ Plus 



4e-95 



Query: 2701 ggagatgttcttggtccttccatcaatggcttagcctcattgattcggatgccttatggc 2760 

MIIIIIIIIIMIIIIIIIIIMIIIMIIIIIIIIIIIIIIIMIIIM.IMIIIIM 

Sbjct : 142831 ggagatgttcttggtccttccatcaatggcttagcctcattgattcggatgccttatggc 142890 
Query: 2761 tgtggtgaacagaacatgataaattttgctccaaatatttacattttggattatctgact 2820 

IIMIIIIIIIIIIIIIIIIIilllllllllllllllllllllllllllllllllMIII 

Sbjct : 142891 tgtggtgaacagaacatgataaattttgctccaaatatttacattttggattatctgact 142950 
Query: 2821 aaaaagaaacaactgacagataatttgaaagaaaaagctctttcatttatgaggcaaggt 2880 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct : 142951 aaaaagaaacaactgacagataatttgaaagaaaaagctctttcatttatgaggcaaggt 143010 



Score = 353 bits (178), Expect = 6e-94 
Identities = 178/178 (100%) 
Strand = Plus / Plus 



Query: 
Sbjct: 



1497 ggtagtatccaggggacagttggtggctgtaggaaaacaaaattcaacaatgttctcttt 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

118260 ggtagtatccaggggacagttggtggctgtaggaaaacaaaattcaacaatgttctcttt 



1556 



118319 



Query: 1557 aacaccagaaaattcttggactccaaaagcctgtgtaattgtgtattatattgaagatga 1616 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct : 118320 aacaccagaaaattcttggactccaaaagcctgtgtaattgtgtattatattgaagatga 118379 
Query: 1617 tggggaaattataagtgatgttctaaaaattcctgttcagcttgtttttaaaaataag 1674 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct : 118380 tggggaaattataagtgatgttctaaaaattcctgttcagcttgtttttaaaaataag 118437 



Score = 347 bits (175), Expect = 4e-92 
Identities = 175/175 (100%) 
Strand = Plus / Plus 



Query: 74 ggcctcggtttctggtgacagccccagggatcatcaggcccggaggaaatgtgactattg 133 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 47605 ggcctcggtttctggtgacagccccagggatcatcaggcccggaggaaatgtgactattg 47664 
Query: 134 gggtggagcttctggaacactgcccttcacaggtgactgtgaaggcggagctgctcaaga 193 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 47665 gggtggagcttctggaacactgcccttcacaggtgactgtgaaggcggagctgctcaaga 47724 



Query: 194 cagcatcaaacctcactgtctctgtcctggaagcagaaggagtctttgaaaaagg 248 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 47725 cagcatcaaacctcactgtctctgtcctggaagcagaaggagtctttgaaaaagg 47779 



Score = 337 bits (170), Expect 
Identities = 170/170 (100%) 
Strand = Plus / Plus 



3e-89 



Query: 3188 agcctaacattgatgtgcaagagtctatccattttttggagtctgaattcagtagaggaa 3247 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 158287 agcctaacattgatgtgcaagagtctatccattttttggagtctgaattcagtagaggaa 158346 

Query: 3248 tttcagacaattatactctagcccttataacttatgcattgtcatcagtggggagtccta 3307 

lilllllMIIMIIIMIMMIMMIIIIIIIIIIIIIIIIIIIIIIIIIIMIlll 

Sbjct : 158347 tttcagacaattatactctagcccttataacttatgcattgtcatcagtggggagtccta 158406 



Query: 3308 aagcgaaggaagctttgaatatgctgacttggagagcagaacaagaaggt 3357 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 158407 aagcgaaggaagctttgaatatgctgacttggagagcagaacaagaaggt 158456 



Score = 313 bits (158), Expect = 5e-82 
Identities = 158/158 (100%) 
Strand = Plus / Plus 



Query: 1673 agataaagctatattggagtaaagtgaaagctgaaccatctgagaaagtctctcttagga 1732 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct : 121633 agataaagctatattggagtaaagtgaaagctgaaccatctgagaaagtctctcttagga 121692 
Query: 1733 tctctgtgacacagcctgactccatagttgggattgtagctgttgacaaaagtgtgaatc 1792 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct : 121693 tctctgtgacacagcctgactccatagttgggattgtagctgttgacaaaagtgtgaatc 121752 



Query: 1793 tgatgaatgcctctaatgatattacaatggaaaatgtg 1830 

IIIIIIMIIIIIIIIIIIIIIMIIIMIIIIIIIM 

Sbjct: 121753 tgatgaatgcctctaatgatattacaatggaaaatgtg 121790 



Score = 293 

Identities = 
Strand = Plus 



bits (148), Expect 
148/148 (100%) 
/ Plus 



5e-76 



Query: 2555 aggctgaaggaatagaaaaatcatattcacaatccatcttattagacttgactgacaata 2614 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct : 138672 aggctgaaggaatagaaaaatcatattcacaatccatcttattagacttgactgacaata 138731 
Query: 2 615 ggctacagagtaccctgaaaactttgagtttctcatttcctcctaatacagtgactggca 2674 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 138732 ggctacagagtaccctgaaaactttgagtttctcatttcctcctaatacagtgactggca 138791 



Query: 2675 gtgaaagagttcagatcactgcaattgg 2702 

IIIIIIMIMIIIIIIIIIIIIIIIII 

Sbjct: 138792 gtgaaagagttcagatcactgcaattgg 138819 



Score = 289 bits (146), Expect 
Identities = 146/146 (100%) 
Strand = Plus / Plus 



7e-75 



Query: 854 agataaatggatctgcaaacttctcttttaatgatgaagagatgaaaaatgtaatggatt 913 

IIIIIIIIIIIIIIIIIIIIIIIIIIIMIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 

Sbjct : 112945 agataaatggatctgcaaacttctcttttaatgatgaagagatgaaaaatgtaatggatt 113004 



Query: 914 cttcaaatggactttctgaatacctggatctatcttcccctggaccagtagaaattttaa 973 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 113005 cttcaaatggactttctgaatacctggatctatcttcccctggaccagtagaaattttaa 113064 



Query: 974 ccacagtgacagaatcagttacaggt 999 

iiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 113065 ccacagtgacagaatcagttacaggt 113090 



Score = 281 bits (142), Expect = 2e-72 
Identities = 142/142 (100%) 
Strand = Plus / Plus 

Query: 1964 atgacaatgcagaatatgctgagaggtttatggaggaaaatgaaggacatattgtagata 2023 

iiiiiiiiiiitiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct : 132820 atgacaatgcagaatatgctgagaggtttatggaggaaaatgaaggacatattgtagata 132879 
Query: 2024 ttcatgacttttctttgggtagcagtccacatgtccgaaagcattttccagagacttgga 2083 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct : 132880 ttcatgacttttctttgggtagcagtccacatgtccgaaagcattttccagagacttgga 132939 
Query: 2084 tttggctagacaccaacatggg 2105 

IIIIIIMIMIIIMIIIMI 

Sbjct: 132940 tttggctagacaccaacatggg 132961 



Score = 256 bits (129), Expect = le-64 
Identities = 129/129 (100%) 
Strand = Plus / Plus 

Query: 506 aggaccccaaatcaaatttgatccaacagtggttgtcacaacaaagtgatcttggagtca 565 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 86587 aggaccccaaatcaaatttgatccaacagtggttgtcacaacaaagtgatcttggagtca 86646 
Query: 566 tttccaaaacttttcagctatcttcccatccaatacttggtgactggtctattcaagttc 625 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii'iiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 86647 tttccaaaacttttcagctatcttcccatccaatacttggtgactggtctattcaagttc 86706 
Query: 626 aagtgaatg 634 

IIIIIIMI 

Sbjct: 86707 aagtgaatg 86715 



Score = 236 bits (119), Expect = 9e-59 
Identities = 119/119 (100%) 
Strand = Plus / Plus 

Query: 2105 gttacaggatttaccaagaatttgaagtaactgtacctgattctatcacttcttgggtgg 2164 

IIIIIIIIIMIIIIMIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 

Sbjct : 133912 gttacaggatttaccaagaatttgaagtaactgtacctgattctatcacttcttgggtgg 133971 



Query: 2165 ctactggttttgtgatctctgaggacctgggtcttggactaacaactactccagtggag 2223 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct : 133972 ctactggttttgtgatctctgaggacctgggtcttggactaacaactactccagtggag 134030 



Score = 234 bits (118), Expect = 4e-58 
Identities = 118/118 (100%) 
Strand = Plus / Plus 

Query: 2222 agctccaagccttccaaccatttttcatttttttgaatcttccctactctgttatcagag 2281 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct : 135568 agctccaagccttccaaccatttttcatttttttgaatcttccctactctgttatcagag 135627 
Query: 2282 gtgaagaatttgctttggaaataactatattcaattatttgaaagatgccactgaggt 2339 

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIMIIIIIIIIIIIIII 

Sbjct: 135628 gtgaagaatttgctttggaaataactatattcaattatttgaaagatgccactgaggt 135685 



Score = 226 bits (114), Expect = 9e-56 
Identities = 114/114 (100%) 
Strand = Plus / Plus 

Query: 996 aggtatttcaagaaatgtaagcactaatgtgttcttcaagcaacatgattacatcattga 1055 

lllllllllllllilllllllllllllllllllllllllllllllllllllllllllMI 

Sbjct : 113780 aggtatttcaagaaatgtaagcactaatgtgttcttcaagcaacatgattacatcattga 113839 
Query: 1056 gttttttgattatactactgtcttgaagccatctctcaacttcacagccactgt 1109 

lllllllllllillllMMIIIIIIIIIIIIIIIIIIIIilllllllllllll 

Sbjct: 113840 gttttttgattatactactgtcttgaagccatctctcaacttcacagccactgt 113893 



Score = 214 bits (108), Expect = 3e-52 
Identities = 108/108 (100%) 
Strand = Plus / Plus 

Query: 3544 caggataccactgtggctttaaaggctctgtctgaatttgcagccctaatgaatacagaa 3603 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct : 161195 caggataccactgtggctttaaaggctctgtctgaatttgcagccctaatgaatacagaa 161254 
Query: 3604 aggacaaatatccaagtgaccgtgacggggcctagctcaccaagtcct 3651 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 161255 aggacaaatatccaagtgaccgtgacggggcctagctcaccaagtcct 161302 



Score = 210 bits (106), Expect = 5e-51 
Identities = 106/106 (100%) 
Strand = Plus / Plus 



Query: 1331 aggcctatttccttggtagtaaaagtagcatggcagttcatagtctgtttaagtctccta 1390 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct : 116963 aggcctatttccttggtagtaaaagtagcatggcagttcatagtctgtttaagtctccta 117022 



Query: 1391 gtaagacatacatccaactaaaaacaagagatgaaaatataaaggt 1436 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 117023 gtaagacatacatccaactaaaaacaagagatgaaaatataaaggt 117068 



Score = 192 bits (97), Expect = le-45 
Identities = 97/97 (100%) 
Strand = Plus / Plus 



Query: 759 . gtatacatatgggaagccagtgaaaggagacgtaacgcttacatttttacctttatcctt 818 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

Sbjct : 112590 gtatacatatgggaagccagtgaaaggagacgtaacgcttacatttttacctttatcctt 112649 



Query: 819 ttggggaaagaagaaaaatattacaaaaacatttaag • 855 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 112650 ttggggaaagaagaaaaatattacaaaaacatttaag 112686 



Score = 176 bits (89), Expect = 8e-41 

Identities = 89/89 (100%) 
Strand = Plus / Plus 



Query: 673 gtattaccaaaatttgaagtgactttgcagacaccattatattgttctatgaattctaag 732 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 109149 gtattaccaaaatttgaagtgactttgcagacaccattatattgttctatgaattctaag 109208 



Query: 733 catttaaatggtaccatcacggcaaagta 761 

IIIIIIIMIIIIIIIIIMIIMIIMI 

Sbjct: 109209 catttaaatggtaccatcacggcaaagta 109237 



Score = 170 bits (86), Expect = 5e-39 
Identities = 86/86 (100%) 
Strand = Plus / Plus 



Query: 2877 aggttaccagagagaacttctctatcagagggaagatggctctttcagtgcttttgggaa 2936 

IIMIIIIIIIMIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 

Sbjct : 153424 aggttaccagagagaacttctctatcagagggaagatggctctttcagtgcttttgggaa 153483 



Query: 2937 ttatgacccttctgggagcacttggt 2962 

I I I I I I I I I I I I I I I I I I I I I I I I I'l 
Sbjct: 153484 ttatgacccttctgggagcacttggt 153509 



Score = 153 bits (77), Expect = le-33 
Identities = 80/81 (98%) 
Strand = Plus / Plus 

Query: 1828 gtggtccatgagttggaactttataacacaggatattatttaggcatgttcatgaattct 1887 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 130630 gtggtccatgagttggaactttataacacaggatattatttaggcatgttcatgaattct 130689 
Query: 1888 tttgcagtctttcaggaatgt 1908 

IIIIIIIIMIIIIII MM 

Sbjct: 130690 tttgcagtctttcaggtatgt 130710 



Score = 149 bits (75), Expect = 2e-32 
Identities = 75/75 (100%) 
Strand = Plus / Plus 

Query: 1 atgcagggcccaccgctcctgaccgccgcccacctcctctgcgtgtgcaccgccgcgctg 60 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 46422 atgcagggcccaccgctcctgaccgccgcccacctcctctgcgtgtgcaccgccgcgctg 46481 
Query: 61 gccgtggctcccggg 75 

iiiiiiiiiiiiiii 

Sbjct: 46482 gccgtggctcccggg 46496 



Score = 135 bits (68), Expect = 3e-28 
Identities = 68/68 (100%) 
Strand = Plus / Plus 

Query: 1433 aggtgggatcgccttttgagttggtggttagtggcaacaaacgattgaaggagttaagct 1492 

IMMMMiMMMMMMMMMMMMMMMMMMMMMMMMM 

Sbjct: 117152 aggtgggatcgccttttgagttggtggttagtggcaacaaacgattgaaggagttaagct 117211 
Query: 1493 atatggta 1500 

MMMM 

Sbjct: 117212 atatggta 117219 



Score = 125 bits (63), Expect = 2e-25 
Identities = 63/63 (100%) 
Strand = Plus / Plus 



Query: 1901 aggaatgtggactctgggtattgacagatgcaaacctcacgaaggattatattgatggtg 1960 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct : 131463 aggaatgtggactctgggtattgacagatgcaaacctcacgaaggattatattgatggtg 131522 

Query: 1961 ttt 1963 
III 

Sbjct: 131523 ttt 131525 



Score = 123 bits (62), Expect = le-24 
Identities = 65/66 (98%) 
Strand = Plus / Plus 

Query: 3652 cttgctgtggtacagccaatggcagttaatatttccgcaaatggttttggatttgctatt 3711 

iiiiiiiiiiiiiiiiiii iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 162411 cttgctgtggtacagccaacggcagttaatatttccgcaaatggttttggatttgctatt 162470 



Query: 3712 tgtcag 3717 

IIIIM 

Sbjct: 162471 tgtcag 162476 



Score = 71.9 bits (36), Expect = 3e-09 
Identities = 39/40 (97%) 
Strand = Plus / Plus 



Query: 634 gaccagacatattatcaatcatttcaggtttcagaatatg 673 

lllllllllll IIIIIIIMIIIIIIIIIIIIIIIIIII 

Sbjct: 106849 gaccagacatactatcaatcatttcaggtttcagaatatg 106888 



Score = 61.9 bits (31), Expect = 3e-06 
Identities = 31/31 (100%) 
Strand = Plus / Plus 



Query: 246 aggctcttttaagacacttactcttccatca 276 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 73455 aggctcttttaagacacttactcttccatca 73485 



>AL591480. 8. 1.91419 

Length = 91419 

Score = 349 bits (176), Expect = 9e-93 
Identities = 176/176 (100%) 
Strand = Plus / Plus 

Query: 4112 ggagacaggcggtgagaagttacaactctgaagtgaagctgtcctcctgtgacctttgca 4171 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 12087 ggagacaggcggtgagaagttacaactctgaagtgaagctgtcctcctgtgacctttgca 12146 
Query: 4172 gtgatgtccagggctgccgtccttgtgaggatggagcttcaggctcccatcatcactctt 4231 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 12147 gtgatgtccagggctgccgtccttgtgaggatggagcttcaggctcccatcatcactctt 12206 
Query: 4232 cagtcatttttattttctgtttcaagcttctgtactttatggaactttggctgtga 4287 

IMIIIMIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIMMIIIIIIIM 

Sbjct: 12207 cagtcatttttattttctgtttcaagcttctgtactttatggaactttggctgtga 12262 



Score = 305 bits (154), Expect = le-79 
Identities = 154/154 (100%) 
Strand = Plus / Plus 



Query: 3859 agcttttcgggcccgggtaggagtggcatggctcttatggaagttaacctattaagtggc 3918 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 7015 agcttttcgggcccgggtaggagtggcatggctcttatggaagttaacctattaagtggc 7074 
Query: 3919 tttatggtgccttcagaagcaatttctctgagcgagacagtgaagaaagtggaatatgat 3978 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct : 7075 tttatggtgccttcagaagcaatttctctgagcgagacagtgaagaaagtggaatatgat 7134 
Query: 3979 catggaaaactcaacctctatttagattctgtaa 4012 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 7135 catggaaaactcaacctctatttagattctgtaa 7168 



Score = 289 bits (146), Expect = 7e-75 
Identities = 146/146 (100%) 
Strand = Plus / Plus 

Query: 3715 cagctcaatgttgtatataatgtgaaggcttctgggtcttctagaagacgaagatctatc 3774 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct : 3607 cagctcaatgttgtatataatgtgaaggcttctgggtcttctagaagacgaagatctatc 3666 



Query: 3775 caaaatcaagaagcctttgatttagatgttgctgtaaaagaaaataaagatgatctcaat 3834 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 3667 caaaatcaagaagcctttgatttagatgttgctgtaaaagaaaataaagatgatctcaat 3726 

Query: 3835 catgtggatttgaatgtgtgtacaag 3860 

llllllllllllllllllllllllll 
Sbjct: 3727 catgtggatttgaatgtgtgtacaag 3752 



Score = 206 bits (104), Expect = 8e-50 
Identities = 104/104 (100%) 
Strand = Plus / Plus 



Query: 4009 gtaaatgaaacccagttttgtgttaatattcctgctgtgagaaactttaaagtttcaaat 4068 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 9090 gtaaatgaaacccagttttgtgttaatattcctgctgtgagaaactttaaagtttcaaat 9149 
Query: 4069 acccaagatgcttcagtgtccatagtggattactatgagccaag 4112 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 9150 acccaagatgcttcagtgtccatagtggattactatgagccaag 9193 



Score = 131 bits (66), Expect = 4e-27 
Identities = 66/66 (100%) 
Strand = Plus / Plus 



Query: 3652 cttgctgtggtacagccaatggcagttaatatttccgcaaatggttttggatttgctatt 3711 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct : 834 cttgctgtggtacagccaatggcagttaatatttccgcaaatggttttggatttgctatt 893 

Query: 3712 tgtcag 3717 
llllll 

Sbjct: 894 tgtcag 899 
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AL590428 163577 bp DNA linear PRI 31-JUL-2001 

Human DNA sequence from clone RP11-553A21 on chromosome 6, complete 

sequence . 

AL590428 AC026605 

AL590428.7 GI:15072593 

HTG. 

Homo sapiens (human) 
Homo sapiens 

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi ; 
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. 
1 (bases 1 to 163577) 
Chapman , J . 
Direct Submission 

Submitted (31-JUL-2001) Sanger Centre, Hinxton, Cambridgeshire, 
CBIO ISA, UK. E-mail enquiries: humquery@sanger.ac.uk Clone 
requests : clonerequest@sanger .ac.uk 

On Aug 1, 2001 this sequence version replaced gi : 15021177 . 
During sequence assembly data is compared from overlapping clones. 
Where differences are found these are annotated as variations 
together with a note of the overlapping clone name. Note that the 
variation annotation may not be found in the sequence submission 
corresponding to the overlapping clone, as we submit sequences with 
only a small overlap as described above. 

This sequence was finished as follows unless otherwise noted: all 
regions were either double -stranded or sequenced with an alternate 
chemistry or covered by high quality data (i.e., phred quality >= 
30); an attempt was made to resolve all sequencing problems, such 
as compressions and repeats; all regions were covered by at least 
one plasmid subclone or more than one M13 subclone; and the 
assembly was confirmed by restriction digest. The following 
abbreviations are used to associate primary accession numbers given 
in the feature table with their source databases: Em:, EMBL; Sw: , 
SWISSPROT; Tr: , TREMBL; Wp : , WORMPEP; Information on the WORMPEP 
database can be found at 

http : / /www . Sanger . ac . uk/ Pro j ec ts/C_elegans/wormpep This sequence 
was generated from part of bacterial clone contigs of human 
chromosome 6, constructed by the Sanger Centre Chromosome 6 Mapping 
Group. Further information can be found at 
http : / /www . Sanger . ac . uk/HGP/Chr6 

RP11-553A21 is from the library RPCI-11.2 constructed by the group 
of Pieter de Jong. For further details see 
http : //www. chori . org /bacpac/ home . htm 
VECTOR: pBACe3 . 6 

IMPORTANT: This sequence is not the entire insert of clone 
RP11-553A21 It may be shorter because we sequence overlapping 
sections only once, except for a 100 base overlap. 
The true right end of clone RP11-553A21 is at 163577 in this 
sequence. The true left end of clone RP11-525G3 is at 88067 in this 
sequence. The true right end of clone RP3-397H23 is at 2000 in this 



http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db==nucleotide&val=15072593 
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AL591480 91419 bp DNA linear PRI 26-JUL-2001 

Human DNA sequence from clone RP11-52 5G3 on chromosome 6, complete 
sequence . 
AL591480 

AL591480.8 GI:15026959 
HTG. 

Homo sapiens (human) 
Homo sapiens 

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi ; 
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. 
1 (bases 1 to 91419) 
Almeida, J. 
Direct Submission 

Submitted (26-JUL-2001) Sanger Centre, Hinxton, Cambridgeshire, 
CBIO ISA, UK. E-mail enquiries: humquery@sanger.ac.uk Clone 
requests : clonerequest@sanger .ac.uk 

On Jul 27, 2001 this sequence version replaced gi : 14586293 . 
During sequence assembly data is compared from overlapping clones. 
Where differences are found these are annotated as variations 
together with a note of the overlapping clone name. Note that the 
variation annotation may not be found in the sequence submission 
corresponding to the overlapping clone, as we submit sequences with 
only a small overlap as described above. 

This sequence was finished as follows unless otherwise noted: all 
regions were either double-stranded or sequenced with an alternate 
chemistry or covered by high quality data (i.e., phred quality >= 
30); an attempt was made to resolve all sequencing problems, such 
as compressions and repeats; all regions were covered by at least 
one plasmid subclone or more than one M13 subclone; and the 
assembly was confirmed by restriction digest. The following 
abbreviations are used to associate primary accession numbers given 
in the feature table with their source databases: Em:, EMBL; Sw: , 
SWISSPROT; Tr : , TREMBL; Wp: , WORMPEP; Information on the WORMPEP 
database can be found at 

http : / /www . Sanger . ac . uk/ Pro j ects/C_elegans/wormpep This sequence 
was generated from part of bacterial clone contigs of human 
chromosome 6, constructed by the Sanger Centre Chromosome 6 Mapping 
Group. Further information can be found at 
http : / /www . Sanger . ac . uk/HGP/Chr6 

RP11-525G3 is from the library RPCI-11.2 constructed by the group 
of Pieter de Jong. For further details see 
http: //www.chori . org/ bacpac /home . htm 
VECTOR: pBACe3 . 6 

IMPORTANT: This sequence is not the entire insert of clone 
RP11-525G3 It may be shorter because we sequence overlapping 
sections only once, except for a 100 base overlap. 
The true right end of clone RP11-525G3 is at 91419 in this 
sequence. The true right end of clone RP11-553A21 is at 2000 in 
this sequence. 



http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi ?db=nucleotide&val=15026959 
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